eoucariowAl ond (( |) MEASUREMENT 








VOLUME FOUR, NUMBER TWO, SUMMER 


An Inventory of Students’ General Goals in Life. Haroip 


Major Strategy Versus Minor Tactics in Merit Administra- 
tion. Frep S. Beers and Cecit R. Brotyer 


How Teachers Can Improve Their Tests. Max D. ENGLE- 


Prediction of College Success by Means of Thurstone’s 
Primary Abilities Tests. Cuartes S$. GooDMAN 


Test Construction in Public Personnel Administration. Dor- 
oTuy C. ADKINS 


| Relationship Between Kuhlman-Anderson Intelligence Tests 
in Grade 1 and Academic Achievement in Grades 3 and 4. 
Mixprep M. ALLEN 


Measurement Abstracts 


News Notes 


Copyright, 1944, by 
SCIENCE RESEARCH ASSOCIATES 





PRINTED IN THE UNITED STATES OF AMERICA 
THE SCIENCE PRESS PRINTING COMPANY 


LANCASTER, PENNSYLVANIA 

















AN INVENTORY OF STUDENTS’ GENERAL 
GOALS IN LIFE 


HAROLD B. DUNKEL? 


Cooperative Study in General Education 


What basic beliefs now constitute college students’ “philos- 
ophies of life” or “designs for living”? What does the college 
student consider the main goals of his life? For the sake of 
what values does he think that he lives? To what extent does 
his total pattern of values seem to meet certain criteria of a 
“good” design for living? The attempt to answer these and 
similar questions led to the project in philosophy of life and 
religion undertaken by the Cooperative Study in General Edu- 
cation.?, Almost without exception the colleges stated some 
such institutional objective as “to aid the student in developing 
a useful and desirable philosophy of life,” and faculty members 
insisted that their institutions did not merely state this ob- 
jective in the catalogues but were sincerely striving by various 
means to aid the student in developing an adequate personal 
philosophy. 

Four possible steps in dealing with the problem of students’ 
philosophies of life can be indicated by the following questions. 
I. Do and should students have philosophies of life? II. If so, 

1 Work on this project has been carried on by faculty members of colleges par- 
ticipating in the Cooperative Study in General Education and by members of the 
central staff working in the field of Humanities, George E. Barton, Jr., Walker H. 
Hill, and the author. Dr. Barton, now Lt. Barton of the U. S. Air Corps, directed 
the original organization of the project and the preparation of the first form of the 
inventory. The complete report on the project, including norms and data on relia- 
bility and validity, is being published this summer. 

2 The Cooperative Study is an organization of approximately a score of colleges 
who attack cooperatively, with the aid of a central staff, their common educational 
problems. For a report of the Study’s work, see Ralph W. Ogan, “The Cooperative 
Study in General Education,” The Educational Record, XXIII (October, 1942), 
692-703. Those interested in greater detail about the development of this project, 


should consult the Staff News Letter (The Cooperative Study in General Education, 
5835 Kimbark Ave., Chicago, Illinois), vol. II, no. 6. 


87 


i 








88 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


what are these philosophies? III. Once we have ascertained a 
student’s philosophy, do we (the student and the faculty) con- 
sider it satisfactory, and on the basis of what criteria do we 
decide? IV. If a student’s philosophy appears unsatisfactory, 
how can the college assist the student in working out a satis- 
factory view of life? 

Our colleges believed that the answer to the first question 
was “Yes,” that students should have a philosophy. Many 
teachers and institutions had made a practice of having stu- 
dents write essays setting forth their philosophies. Since these 
essays had all the limitations of any verbal statement (they 
are what the student thinks, or wishes to think, or wishes the 
reader to think, is his philosophy), these teachers were of 
course not under the misapprehension that there is a perfect 
correspondence between the way men live, and the way men 
say they would like to live. The hypocrite, the person with 
selfish or anti-social aims which he fears to express, the victim 
of social or economic pressure, all undeniably exist. Those 
undertaking the project believed, however, that it is equally 
incorrect to assert that there is absolutely no relation between 
the way in which people talk about life and the way in which 
they live. The group felt that in many cases the verbal state- 
ment about the kind of life the student was seeking would be a 
useful index, even though not a perfect one, of the kind of life 
the student was living or seeking to live. 

In the case of many students, when there is a discrepancy 
between stated beliefs and conduct, this inconsistency exists 
without the person’s being aware of it. For these students, an 
opportunity to state their beliefs precisely and to examine the 
implications of those beliefs for action may often lead the 
student to harmonize more closely his pattern of beliefs and 
his pattern of living. 

Other students actually do not know what they believe. 
They have never thought much about their way of life or about 
what they believe. Having acquired their schemes of value 
rather unconsciously and at random from various sources, these 
students, when they leave the pattern of life into which they 
were born and grew up, and when on coming to college they 











for 
ext 
eve 


ex] 
of ; 
pec 
rea 
not 


Ra 
liv 
ati 
of 
ste 
our 
of | 
int 


IT) 


was 





ve 


ys 
iS- 


ve. 
yut 
jue 


ey 














STUDENTS’ GENERAL GOALS IN LIFE 89 


meet new and perhaps conflicting values represented by new 
environment, new friends, and new experiences, have consider- 
able difficulty in ordering their personal lives for the first time. 

Without help in seeing and resolving some of this conflict 
and confusion, the student may go through much if not all of 
his adult life, a victim rather than a master of these warring 
values. The purpose of the proposed device was, therefore, to 
enable the student to see clearly what he says he believes, to 
compare it with the philosophies of other students, and to ob- 
tain help in overcoming difficulties and conflicts in his own 
position. 

The group did not believe that all these results could be 
satisfactorily produced at the verbal level alone. Even on the 
verbal level, we certainly did not wish that a student should 
work out for himself, once and for all, an immutable philosophy 
of life. We believed, however that 


most teachers . . . will agree that no student should merely drift 
through life, allowing his major decisions and actions to be determined 
for him entirely by circumstances—to be merely the resultant of 
external forces impinging upon him. Each student should bring to 
every important life decision some “sense of values” which is his own 

me Of course such a “design for living” can never be completely 
explicit .... In a brief, oversimplified, presentation it [the idea 
of a design] sounds unduly rationalistic—as if each student were ex- 
pected to chart all the details of his life, to be conscious of all the 
reasons for his every act, and to refer all decisions to some grandiose 
notion of life and the universe. We do not mean this.* 


Rather the philosophy should be a “working hypothesis” for 
living, to be revised as and if necessary, in the light of new situ- 
ations and insights. Though a satisfactory verbal statement 
of a philosophy of life is not the whole journey, it is at least a 
step in the right direction. Such reasoning as this, then, led 
our group to the belief that students should have a philosophy 
of life (the group’s answer to the question of Step I) and to an 
interest in securing a verbal statement of this philosophy (Step 
IT). 

A further result of conferences between faculties and staff 
was the decision that some objective means of securing and re- 


3 Staf News Letter, Vol. II, no. 6, p. 6. 











90 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


cording the student’s statement of his philosophy was desirable. 
Although, to be sure, the response in the form of an essay gives 
certain insights which probably cannot be made available 
through objective devices, the existence of the objective in- 
strument would not preclude the use of the essay. 

On the other hand, the objective instrument would have, 
the group felt, several advantages. In the first place, the 
“philosophies” thus recorded would be more exactly and more 
easily comparable. For example, the score of one student could 
then be compared to his own scores at other points in his edu- 
cational career, to the scores of other individuals, and to the 
scores of groups. Similarly, scores of sections, classes, or entire 
student-bodies could be compared, and if desired, judged. 
Likewise the philosophy of a student who, when left to himself, 
deals in airy generalities would have more points of comparison 
with that of a student who plunges into petty minutiae; and 
the statement of the verbally facile student could more mean- 
ingfully be compared with that of one less articulate. 

Furthermore, the records of these philosophies would be 
more concise and more usable. Faculty-members who wished 
to learn about the philosophy of a group of students would no 
longer face the difficult and often impossible task of reading 
through hundreds of pages of student manuscript. Rather, 
the results stated numerically would permit fairly rapid sum- 
marization and comparison. Then finally, instead of existing 
like the essay in a single copy tucked away in some personnel- 
folder or teacher’s files, copies of the results of an objective 
record could be made available for study and consultation, 
easily and inexpensively, to every person concerned with the 
student. For these reasons, then, the preparation of some ob- 
jective device of this sort seemed eminently dsirable as a first 
step. 

Many aspects of the student’s philosophy of life deserved 
study; but, for various reasons which need not be discussed 
here the group began work by studying the main goals which 
students hold. Hence, though other studies were planned to 
supplement the information gained from this device, the first 
instrument which the group undertook to construct was the 
“Inventory of General Goals of Life.” 








ai UFtlUlm,,.lCUrOUOlUCUrOUlC Mh hLUhUCUr rh UC || UCSC CU lC li 


|1_ouo mrs == =« #4 





ble. 
ves 


able 


ave, 

the 
10Fre 
uld 
>du- 

the 
tire 
ged. 
self, 
ison 
and 
-an- 


| be 
hed 
1 no 
ling 
her, 
um- 
ting 
nel- 
tive 


10n, 


the 


first 


ved 
ssed 
uich 
| to 
irst 
the 











STUDENTS’ GENERAL GOALS IN LIFE 91 


First, a list of possible main goals was secured from papers 
written by students and from teachers in the group. In the 
selection of the goals finally included in the inventory, two 
criteria were dominant. ‘The first was that the goals listed, 
though couched in student phraseology, should give an ade- 
quate representation of certain of the great historical traditions 
of philosophy and religion. The second was that the list should 
also include goals which, though less common in formal philos- 
ophy and religion, are familiar in our culture or in “cracker- 
barrel philosophy.” Within the limitations imposed by prac- 
tical conditions, we wished each student to find expressions 
which he could consider adequate statements of elements of his 
own point of view. 

The number of possible “philosophic” positions (to use the 
term in its widest possible sense) and the number of statements 
that can be framed to express any single one are, of course, 
enormous. Yet the practical situation demanded that the final 
list be extremely brief. We tried to secure brevity in three 
ways. (1) We allowed certain goals between which distinc- 
tions are customarily made in philosophy and which yet appear 
to be closely related to one another in the minds of most stu- 
dents, to stand together in a single statement.‘ (2) Certain 
goals such as the attainment of Nirvana, which are familiar in 
the history of thought but which are not commonly held by 
American college students, were-omitted. (3) Certain goals 
common in student philosophies but in a subordinate position 
were also omitted.° 

Closely connected with the selection of the goals was the 
question of what technique should be used. ‘Since the purpose 
was to get the student to rank the goals in order of their im- 
portance and acceptability to him, the device of “paired com- 
parison” appeared appropriate since this technique facilitates 
accurate and easy ranking of various possibilities. Thus each 

4 An example of this sort of statement of goal is “Peace of mind, contentment, 
stillness of spirit.” 

5 An example is “good health.” Though many students list it as a goal, few if 
any students seek health as an end in itself. Rather they consider it a necessary 
condition for attaining other goals which they consider more important. Since the 


aim of the inventory was to secure some indication of the more dominant goals, these 
subordinate goals were omitted. 








92 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


of the twenty goals is paired once with all the other nineteen, 
and the student is asked to choose one goal of each pair. Asa 
result of this process, the goal for which the student has the 
greatest preference will be selected nineteen times (i.e., every 
time it appears) or will have a “score” of nineteen. The stu- 
dent’s second choice will be selected eighteen times (a score of 
eighteen ), being rejected only in the pair where it appears with 
the most acceptable goal. The other goals follow with dimin- 
ishing scores until the goal which the student finds least accept- 
able or rejects most emphatically is reached. In short, the 
student who manages to make his choices with perfect con- 
sistency will give to goals “scores” which will rank them in 
order.° 

In interpreting the inventory, one looks first at the goals 
which are ranked at the head of the list. (The exact number 
of goals considered depends partly on the nature of the par- 
ticular goals concerned, partly on the arrangement of the scores 
in terms of gaps or ties in the ranking.) These goals at the 
head of the list are the statements which most appeal to the 
student. Next the goals at the foot of the list are considered 
since the student’s philosophy as indicated by the goals most 
readily accepted is often further defined and clarified by the 
goals which he rejects (or accepts least readily ).’ 

The way in which the inventory records a philosophy of life 
can probably best be indicated by an example. This student 
before taking the inventory, wrote a brief essay stating her 
philosophy. This essay is reproduced here and followed by a 

6 For several reasons which are too complicated for discussion here, perfect 
“consistency” is rare, and ties are common in the rankings made by students. 

7In interpreting the inventory, usually little weight is given to those goals 
which appear in the middle of the list. The reasons for this procedure are both 
philosophical and statistical. 

Statistically, extreme deviations are of course much less likely to occur because 
of pure chance. 

Philosophically, the basic concepts of many positions can be stated by accepting 
relatively few of the statements in this list and by rejecting a few others. For a 
particular philosophical position, many of these goals listed may be irrelevant or 
meaningless. The form of the test does not, however, permit the student to discard 
these goals literally. He must continue to make choices involving them. He tends 
to mark those goals to which he is indifferent somewhat lower than those goals 
which he accepts as statements of his position, yet somewhat higher than those goals 


which his position necessarily rejects. In other words, the goals toward which he 
is neutral or indifferent tend to appear in the middle of his ranking. 








~~ — 7 FF we a> me ot Le Nee se TO ao -— -— — wn an ClUttlC AlCl 


QQ = -— ee 





en, 
sa 
the 
ery 
tu- 
- of 
ith 


pt- 
the 


ect 


vals 
oth 


ing 
rs 


ard 
ids 
als 
als 


he 











93 


STUDENTS’ GENERAL GOALS IN LIFE 


listing of the order in which this student subsequently ranked 
the goals of the inventory. The case is fairly typical in that, 
while the statements in the philosophy are not verbally identical 
with the wording of the goals in the inventory, the general 
nature and tone of the philosophy are reproduced. 


My Philosophy of Life 


I find it difficult to state my philosophy of life. One’s philosophy 
is far from being an immutable thing, and mine is constantly chang- 
ing from contact with the outside world. The best I can do is to 
state the basic principles which are elementally stable. 

I first of all think that life is to be enjoyed. Happiness no matter 
how brief, leaves a lasting impression that lives on from one period to 
the next. Its memory fills the more somber moments with a hope 
for better things to come. If I am happy I can be successful, and 
success gives the impetus to do better things. 

Every man on earth is here for some purpose. I believe that I 
can attain that purpose by doing the best I can with everything I do. 
The thing at which I was intended to excel will show itself in the en- 
joyment I derive from doing it. At the present time my greatest 
source of joy is in writing. I cannot say yet that writing is my talent. 
What I write is immature and sometimes I become very discouraged 
when my words do not match my thoughts. But to know that I 
have written a line full of life and color gives me more pleasure than 
any other thing I do. 

I am not concerned with the life to come after this one. I do not 
pretend to know what it holds for me or how I can change it. I am 
concerned with the present. It is of too much importance to be con- 
sidered as a doormat to the house of the next life. The present is a 
house in itself and it must be lived in. I believe that if this life were 
not meant to be enjoyed it would not hold so many sources of joy. 

It is not my ambition to be remembered for generations to come 
because of some great accomplishment. If I succeed in doing one 
thing really well, I shall be satisified. It was never intended that I 
be a famous person, but I am going to see to it that I am a successful 
one. 


Inventory Scores 
“Score” Goal 


18 Getting as many deep and lasting pleasures out of life as I can. 

17 Promoting the most deep and lasting pleasures for the greatest 
number of people. 

17 Self-development—becoming a real, genuine person. 

16 Fine relations with other persons. 

15 Making a place for myself in the world; getting ahead. 

14 Handling the specific problems of life as they arise. 

13. Peace of mind, contentment, stillness of spirit. 

11 Power; control over people and things. 














\o 
r= 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


i) 


Serving the community of which I am a part. 

Self-sacrifice for the sake of a better world. 

Living for the pleasure of the moment. 

Serving God; doing God’s will. 

Achieving personal immortality in heaven. 

Self-discipline—overcoming my irrational emotions and sen- 
suous desires. 

Doing my duty. 

Survival, continued existence. 

Being able to “take it”; brave and uncomplaining acceptance 
of what circumstances bring. 

Finding my place in life and accepting it. 

Realizing that I cannot change the bad features of the world 
and doing the best I can for myself and those dear to me. 

1 Security—protecting my way of life against adverse changes. 


BRAND ANNOOe 


Row 


In her essay, this girl first mentions, as an aim in life, the 
desire to be happy. This desire is reflected in the goal which 
heads her list in the inventory, “Getting as many deep and 
lasting pleasures out of life as I can.” (Of the statements in 
the inventory this is the expression nearest to “happiness.” ) 
The rank she gives to “Peace of mind” indicates the same atti- 
tude. The social aspect of this same aim appears in two other 
goals she ranks high: “Promoting the most deep and lasting 
pleasures for the greatest number of people” and “Fine rela- 
tions with other persons.” 

The second aim emphasized in the essay is the desire to be 
successful, to do the best she can in everything. In the inven- 
tory this aim is indicated by the high ranking of the goals of 
“Self-development” and “Making a place for one’s self in the 
world.” 

In her essay, this student stresses the present and is specifi- 
cally indifferent only to the problem of “the life to come after 
this one.” Hence the goal of “Handling the specific problems 
of life as they arise” ranks high in her list and that of “personal 
immortality” falls in the lower middle. 

At the end of her list come, as might be expected in the case 
of one who wishes to make a place for herself in the world, the 
more passive goals of “Finding my place in life and cheerfully 
accepting it” and merely “Being able to take it.”* 


8 Taking Steps III and IV, one should ask whether this philosophy as expressed, 
is adequate, and if not, how the institution can assist the student in securing a better 
point of view. Although considerable work has been done along this line, limita- 
tions of space prevent the inclusion of any account of it in the present article. 








te ~~ 71 —lCOR_llOellCO Oe 


ro 








sen- 


ince 











95 


STUDENTS’ GENERAL GOALS IN LIFE 





Possibly these brief comments will suffice to show the more 
important relations between the essay and the inventory, 
though practice gained through interpreting a number of scores 
makes the results more meaningful to the user. Some differ- 
ences between essay and inventory may also be noted. For ex- 
ample, the interest in writing, stressed in the essay, does not of 
course appear in this inventory. On the other hand, the inven- 
tory reveals the student’s point of view on a number of issues 
which she did not mention in her essay. Unfortunately a sin- 
gle example cannot illustrate the many varieties of viewpoint 
and the striking contrasts between them which can be indicated 
by the inventory, nor can it show the analyses and comparisons 
possible for the scores of groups of students.° 

It is still too early to make a final evaluation of the inven- 
tory. Data are still coming in, and studies of the validity and 
other aspects of the instrument are still being conducted by the 
staff and by the cooperating institutions. Nonetheless it is 
fair to say on the basis of the evidence now available that the 
inventory has proved extremely useful and seems a reliable in- 
strument for the purpose for which it was intended. 


9 Studies of several groups of students and adults can be found in Staff News 
Letter, vol. IV, no. 11. 

















MAJOR STRATEGY VERSUS MINOR TACTICS IN 
MERIT ADMINISTRATION 


FRED S. BEERS ann CECIL R. BROLYER 
Social Security Board 


CiviL service or merit systems are relatively new to the 
American scene of government. Even the most venerable of 
them are only two generations old; but between 1935 and 1944 
the number skyrocketed until there were state-wide systems or 
ones that included a combination of departments in all but 
three states. Stimulation for this phenomenal growth came in 
part from federal legislation, particularly the Social Security 
Act, which in 1939 was amended so that selection of personnel 
on a merit basis was one of the requirements for receiving grants 
in public assistance and unemployment insurance. 

Passing laws is risky business—witness the prohibition 
amendment. Civil-service legislation has sometimes been 
passed in nearly as precipitous a fashion. Although the re- 
sults have been notably better, they have left little doubt that 
statutes alone, indispensable as they are, cannot make a merit 
system. Practice must keep step with legal precept if public 
acceptance is to be sufficiently durable to make laws effective. 
Philosophically inclined members of the legal profession say 
that 80% of the people must be doing voluntarily what a law 
dictates if the 20% who need coercion are to be brought into 
line. 

For more than a century the American tradition of per- 
sonnel selection was opportunistic. Officeholders were dis- 
possessed regularly with each change in administration. Hav- 
ing an average expectation of tenure amounting to two years, 
they spent the first year learning what they were supposed to do 
and the second helping to re-elect the chief administrator. 
This practice was a main source of popular merriment. The 
97 


98 EDULATIONAL AND PSYCHOLOGICAL MEASUREMENT 


cost was enormous, but Americans have always paid hand- 
somely for their entertainment. In the field of business and 
industry, competitive advantage, often cut-throat in nature, 
took precedence over other considerations. Extended to per- 
sons, the principle found expression in hiring, firing, and fierce 
competition for jobs. 

Even in education the pressures against reasonable selection 
were legion, and there had been marked lag in the adjustment 
of opportunity to individual and social needs. For example, 
before the College Entrance Examination Board was founded 
in 1900 the colleges had been “evaluating” secondary schooling 
by means of records not even superficially comparable. When 
they had muddled through to elements of a rational policy for 
admission, some presidents still objected. One, more naive or 
pragmatic than the others, “pointed out that a college might 
wish to show special favor to sons of large benefactors, to sons 
of trustees, or to sons of public men of importance who pre- 
sumably would have difficulty in meeting the announced ad- 
missions requirements.” * 

Eventually a system of personnel selection may develop 
that can with justice be called scientific. Those who have 
studied the problem know that on theoretical grounds an in- 
finite number of such systems exist as possibilities. Two 
questions would have to be answered if a particular system were 
chosen for practical application: Which of the possible systems 
would be most useful?, and Who would select it? According to 
democratic ideology, the citizens would decide and choose. It 
would be useful to the extent that its results were sensed as 
beneficial to the entire society; it would be selected by the 
entire society. If the urge for scientific selection were com- 
pelling, we would now be searching for our undefined or basic 
terms; we would be exploring the postulates smallest in number 
but useful to attaining our aims. With these goals reached, our 
conclusions and their application would follow as inevitably as 
night follows day. But generations may come and go before 
public opinion can forge a method of selection worthy to be 
called scientific. 


1The Work of the College Entrance Examination Board, 1901-1925. New 
York: Ginn & Co., 1926. 




















MERIT ADMINISTRATION 99 





Meanwhile we must consider available alternatives. Among 
these the merit system seems best, although its imperfect tech- 
niques can be much improved as this article will attempt to 
point out. 

In their present stage of development merit systems cannot 
pretend to science. At best they represent a compromise be- 
tween the rational and the instinctive, drawing as they can on 
elements of the scientific that lie snugly adjacent. Internally 
their promise of development lies in a frank recognition that 
theirs is an art; that, as in clinical medicine, their art is un- 
worthy of the name if any possible conjunction with science is 
overlooked or set aside. Externally their hope lies in shaping 
public opinion by the unswerving integrity of their practice and 
the truth and cogency of their precepts. The basic principles 
should be and are self-evident. They should both reflect and 
illuminate the democratic ideal. 

Five basic principles for a merit system of public adminis- 
tration may be said to have reached the stage of general accep- 
tance. Ina democracy they are axiomatic: (1) open competi- 
tion as the method of choosing public employees, (2) selection 
by practical and scientific methods of the most competent per- 
sonnel available from open competition, (3) equal pay for equal 
work, (4) a career service conditioned upon meritorious per- 
formance of work, and (5) the right of appeal from any or all 
personnel actions. 

It is unfortunate but not less than human that disagree- 
ments over the minutiae of carrying out a principle sometimes 
become so noisily vociferous that the still, small voice of the 
principle is lost in the general clamor. Mere contentiousness, 
however, is a minor ill. More critical distractions from the 
principles arise from legitimate differences of opinion and the 
swift crosscurrents of deadlines to meet, techniques to refine, 
inefficiencies to scotch, rituals to perform, red tape to unwind, 
irate applicants to soothe, and special privilege seekers to 
confront. 

In the hurly-burly it is sometimes forgotten that merit sys- 
tems are never ends in themselves. They exist to serve the 
operating agencies. But merit systems should be given ad- 











100 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


ministrative strength equal to that of the operating programs 
and should have their unqualified support. Lacking this, they 
may in time degenerate into irresponsible and malodorous 
adjuncts of the spoils system. The relationship between the 
operating agency and the merit system is reciprocal. As the 
one prospers, so does the other. If the one sinks into respect- 
able somnambulism, the other is desolate also. 

Some of the complexities involved in applying basic merit 
principles will be apparent if we examine each principle sepa- 
rately: (1) The principle of open competition as a generaliza- 
tion is especially alluring in a democracy. When H. L. Men- 
cken in his salad days translated the Declaration of Independ- 
ence from sonorous Johnsonese into American, he voiced the 
underlying principle in language dear to the heart—“Every- 
body is as good as everybody else and maybe a damn sight 
better.” 

In greater or lesser contradiction of the principle, minimum 
qualifications of education or experience or both are often set up 
as obstacles to competition. The rationalization is that these 
hurdles constitute protection against the cost and waste of fail- 
ing large numbers of applicants, experience having shown that 
most people with 4th grade education and wheelbarrow work- 
history flunk out on examinations for white-collar jobs. Per- 
sons of great natural endowment are, of course, less dependent 
upon formal education than others. Americans like to think, 
however, that Lincoln, the Wright brothers, and Edison are 
typical rather than exceptional. So-called minimum qualifi- 
cations may sometimes be subordinated to the interests of 
special vocational or professional groups; if severe restrictions 
are imposed, they may become subversive of the principle of 
competition they were designed to facilitate. The danger is 
accentuated by the fact that amount rather than quality of 
experience is credited. (Comparable data on quality of experi- 
ence are almost unattainable. ) 

Local residence is almost completely indefensible as a hurdle 
to competition. Yet it is widely accepted, perhaps because it 
expresses the yearning of the community to regard itself and its 
members as better than... . By way of illustration it is rumored 








Rr Ba —_—_ Aa OF 


nH —_— ~*~ Vf ff - 85 Fo  —- — 2 2 tl 


~ -_- -*>, 


— > ~e Ss -_-|-lUrreltlhlc Or; 


as 


—_ ona 








ams 
hey 
rous 
the 
the 
ect- 


erit 
»pa- 
iza- 
[en- 
nd- 
the 
ory- 
ight 


um 
t up 
1ese 
‘ail- 
hat 
yrk- 
er- 
lent 
ink, 
are 
lifi- 
; of 
ons 
» of 
r is 
- of 


eri- 


‘dle 
e it 
its 








MERIT ADMINISTRATION 101 


that every faithful citizen of Boston hopes some day to move to 
New York and not to like it! 

Preference for special groups limits the application of open 
competition. Lady Bountiful scattering sunshine by her pres- 
ence and gifts from her well-stocked larder is a national idol. 
Except in poetry we seldom think that “Earth gets its price for 
what Earth gives us.” For example, recent polls show exten- 
sive public support for continuing the tradition of preference 
in the public service for veterans, their wives, and kin. What 
would be the effect on public psychology if proposed legislation 
read about as follows: If you are not a veteran, you will have 10 
points subtracted from your competitive score on whatever 
examination you take for a position in the public service; if you 
are a veteran but not wounded, you will have 5 points sub- 
tracted. It is not only in our efforts to befriend veterans that 
we fall into the illogical quicksand of something for nothing. 
We do so elsewhere. All of us are appalled at the “mounting 
death rate owing to heart disease.” We are brought up short, 
however, when we are asked what death rate we should like to 
see rise if that for heart disease could be reduced. 

(2) Selection of the best available as a result of competition 
commends itself to the rationally minded. Most commonly 
the process consists of a written examination, a rating of educa- 
tion and experience, and an oral interview. Resulting scores 
are combined by differential weighting, each weight presumably 
determined from the relative appropriateness of the part scores 
to the requirements of the job. At the risk of over-simplifica- 
tion, we may regard the written examination as a check on the 
powers of thinking and the extent of appropriate knowledge; 
the rating of experience, a check on stability—the staying pow- 
ers of the candidate; and the oral interview as a check on his 
histrionic abilities—how good he is as an actor. 

It is in the area of the written examination that personnel 
administration today comes most closely into conjunction with 
science. Full advantage should be taken, but often is not, of 
this fact. Scientific measurement has come about within the 
memory of men still living. Between the early attempts of 
Galton and Cattell to apply to human traits measures analogous 


102 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


to those used in physical science and the appearance approxi- 
mately two decades later of the Army Alpha lies a period of 
significant advances. Since that time, in schools, colleges, and 
in our armed forces this science has come of age, both in theory 
and in application. It has been expanded to include special 
skills and aptitudes, attitudes and interests, and is being ex- 
tended to include relatively unique differential traits. In merit 
administration the advantages this science offers in “arranging 
candidates in order” is recognized. 

Rating education and experience has been an armchair exer- 
cise. About all that is reliably measured is amount of time 
spent in school or at work; the quality of the experience and its 
effect upon the person are ignored. True, those who have gone 
to school for a long time or have held responsible positions are, 
if a random group, more likely to succeed in similar circum- 
stances than a corresponding group of the same age who have 
not. But extensive studies by competent observers have 
shown, for example, that there is a college for every level of 
ability in almost every State, and that variability of achieve- 
ment within colleges is marked. But neither of these facts is 
considered in evaluating education. Although the studies of 
work experience have not been so extensive nor perhaps so care- 
fully made, little strain is put on the imagination in recogniz- 
ing that evaluating work-experience and schooling are equally 
complex. What is true of the one would more than likely be 
true of the other. 

That too much reliance can easily be placed upon work- 
experience is made clear by three commonly practiced adminis- 
trative devices plus a factor that we shall call “survival.” 
When an administrator has on his staff a difficult or inept per- 
son—but not so bad that dismissal is, under the rules of tenure, 
relatively simple—he watches his opportunity to “unload” that 
employee on someone else. The inept one moves on to his new 
position happily unaware of what has happened to him. And 
he may be similarly unloaded several times, carrying with him 
as he moves crabwise a record of lengthening and satisfactory 
experience. If he were given an unsatisfactory record, he could 
not be unloaded. The second administrative practice is famili- 








—_- - ~~ wee 


~~ @®> f® fF = - &F. flee oe 





OX1- 
1 of 
and 
ory 
cial 


erit 
Ing 


xer- 
ime 
| its 
one 
are, 
im- 
ave 
ave 
| of 
ve- 
s is 
: of 
ire- 
11Z- 
ally 

be 


rk- 
nis- 
al.” 
er- 
Ire, 
hat 
ew 


ind 


ory 
uld 











MERIT ADMINISTRATION 103 


arly known as “kicking upstairs.” For equally good, or bad, 
reasons the unwanted individual is eliminated through promo- 
tion or reassignment within the organization but at a higher 
level. The third practice is “holding down.” That is, an em- 
ployee may be so valuable in a low position that the administra- 
tion cannot afford to advance him. In any event the record 
of the worker, as seen by an objective observer, masks at least 
a part of the truth. 

As for the factor of “survival,” it is among the commonest 
of influences contributing to unreliable rating of work-experi- 
ence. In every organization people tend to get bored. Those 
most ambitious, energetic, and competent will find opportu- 
nities for advancement more readily than others. If top man- 
agement by too little or poor leadership contributes to this 
centrifugal force, the better people rather than the weaker ones 
will tend to find elsewhere jobs more suitable to their talents. 
These vacated jobs are likely to be filled from among those who 
are left. Eventually persons having long, solid, C + experience 
predominate. This is the factor of survival that over a period 
of years contributes to high ratings on experience for mediocre 
individuals and, incidentally, may staff the top jobs with in- 
competents. 

Oral interviews could be a very useful part of the selection 
process. Although they are extensively used for positions re- 
quiring contacts with other workers and with the public, their 
value is sometimes questionable. Part of the difficulty results 
from failure of examiners to recognize the limitations of the 
interview and to refine techniques that will point up its virtues. 
Essentially the interview should be a check on the candidate’s 
powers of acting. That everyone is an actor will be imme- 
diately apparent if we reflect for a moment on what happens 
during every conversation. We keep the lid on those things 
that would give offense and try to express only those that will 
please or that will accomplish our ends with the least amount 
of friction. When the “oral” is used as the casting director 
employs the tryout, we shall have begun to refine it for more 
effective use. The casting director asks the people who come 
before him to “register” this or “register” that. His applicants, 








104 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


who aim to please, respond accordingly; and the director then 
makes a selection from among them according to the require- 
ments of the part that is to be played. “The part” and “the 
requirements of the part.”—these are the terms that need 
definition in any development of devices to arrange competitors 
in relative order of skill. 

To sum up, selection normally consists first of a written test. 
Such a test can and should be carefully prepared since a science 
for doing this is available. It is the best method of determining 
the would-be public servant’s knowledge and his ability to use 
that knowledge. The rating of training and experience is a 
second factor contributing to the selection process. Has the 
applicant been through the conventions of society, both educa- 
tional and social, that are thought to be prerequisites to com- 
petent performance on the job? Sole reliance on these con- 
ventions, prerequisites, Or minimum requirements implies a 
faith in two other propositions patently false, namely, that all 
individuals have equal endowment and that all respond equally 
to experience. If these last two propositions are not true, then 
there is always a real probability that someone can be found 
who could do a competent job although lacking minimum 
qualifications. Nevertheless, the establishment of some con- 
ventions is administratively desirable and even essential. But 
their promulgation by fiat should be made in a state of self- 
awareness rather than in a state of self-delusion. These con- 
ventions simply apply two questions to the competitor: Is he 
reasonably well equipped for the position for which he is com- 
peting? Does his experience indicate dependability? Because 
of limitations in evaluation, in the individual, and in society, 
such ratings are purely judgmental today; that is, they are still 
unscientific. Finally, the oral interview can and does con- 
tribute to the selection process. Its purpose needs clarifying 
and its techniques refining. The three in optimal combination 
yield better results than any one alone. 

(3) Equal pay for equal work is a necessary complement to 
open competition and selection of the best qualified if a merit 
system is to be productive of an effective working spirit. Few 
forces are more disruptive of staff morale than inequities in pay 








mam 6d 


ow 


ll len a ot. a ne a a ae | 


rf wr FS O05 








hen 
ire- 
‘the 
eed 
tors 


est. 
nce 
ing 
use 
isa 
the 
ica- 
»m- 


sa 
all 
ally 
hen 
ind 
um 
on- 
But 
elf- 
on- 
he 
ym- 
use 
ty, 
till 
on- 
ing 
ion 








MERIT ADMINISTRATION 105 


andrank. From the principle of just recognition for work done 
stems the durable structure of an organization, the position 
classification and pay plan. In principle equality and fairness 
are axiomatic. They need no defense. But they are as difh- 
cult and elusive to apply as they are self-evident. There is as 
yet no science of organization; no clear dependence of pay 
scales on ineluctable fact. Even more distressing is the ap- 
parent ease with which known elements of human psychology 
and close-to-earth experience can be ignored. 

If keen observation and fidelity to meaning of facts can be 
called psychology, the novelists are often to be classed among 
the best psychologists. As Charlotte Bronte wrote in the 
“Editor’s Preface to the New Edition of Wuthering Heights”: 


The writer who possesses the creative gift owns something of 
which he is not always master—something that, at times, strangely 
wills and works for itself. He may lay down rules and devise princi- 
ples, and to rules and principles it will perhaps for years lie in sub- 
jection; and then, haply without any warning of revolt, there comes 
a time when it will no longer consent to “harrow the valleys, or be 
bound with a band in the furrow”. . . . When, refusing absolutely to 
make ropes out of sea-sand any longer, it sets to work on statue- 
hewing. . . . Be the work grim or glorious, dread or divine, you have 
little choice left but quiescent adoption. As for you—the nominal 
artist—your share in it has been to work passively under dictates you 
neither delivered nor could question. . . . If the result be attractive, 
the World will praise you, who little deserve praise; if it be repulsive, 
the same World will blame you, who almost as little deserve blame. 
On first reading one gets the impression that here is, per- 
haps, the best definition of genius he has seen. But closer re- 
flection, stimulated by the shift in point of view from the 
impersonal “he” to the very personal “you,” suggests that 
Charlotte’s description is not of genius at all but of everyone 
whoishuman. The ordinary or garden variety of classification 
and pay plan under a merit system leaves much undone that 
might with a little effort mold such plans closer to the heart’s 
desire. The typical pyramidal concept leaves out of account 
much of the universal human urge for elbow room—freer play 
for talent: At the top administration, next supervision, then 
professions and specialized vocations, and finally the workers. 
Even at the top it is quite possible to search dozens of plans 








106 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


and not find one that would make provisions elastic enough to 
attract a man like William James—physician, psycholegist, 
philosopher, man of letters; or Benjamin Franklin—artisan, 
scientist, inventor, diplomat, statesman. 

Inelasticity, however, is not the only weakness. Of greater 
jeopardy to the principle of fairness, but perhaps more easily 
remedied, is the failure of many operating administrators to see 
the manifold advantages a classification and pay plan can con- 
tribute to good administration. Too often they regard the 
structure as a hampering rather than a facilitating device; by 
circumventing the requirements they put themselves in the 
position of the sick man who brushes aside his physician and 
clings to his amulet. Merit systems must contribute more 
than they do now to informing and instructing the operating 
agencies—administrators and workers alike. 

(4) Fourth among the major principles of merit administra- 
tion is the requirement that the system provide for a career 
service based on merit. Basic to this principle is tenure of 
office for those who have been selected as the best available 
and who have qualified for permanent appointment by having 
served successfully a working-test or probationary period. 
Once they have passed muster it is in the interest of the public 
service that they be given reasonable assurance of holding their 
jobs. With this assurance they are free from fear, and can 
direct their energies to increased job efficiency. Bernard 
Shaw, however, cautions those who start at the bottom and, 
hopefully, climb the proverbial ladder rung by rung. You 
don’t learn even to hold your own, he says, “by standing on 
guard but by attacking and getting well hammered yourself.” 

A policy of promotion-from-within is another element of a 
career service. It is carried to extremes when it is so nearly 
like the human circulatory system that elaborate preparations 
and intravenous injections are necessary for the purpose of in- 
troducing a little new blood. Very often new blood brings new 
life, as many a soldier can testify. “Permanent tenure” is not 
an unmixed blessing. It has even been called a necessary evil. 
Those who accept it as a principle of administration and those 
who profit from it as employees should understand its hazards 





mes ee 


ie ae 








tra- 
reer 
> of 
able 
ying 
iod. 
blic 
heir 
can 
yard 
and, 
You 
+ on 
olf.” 
of a 
arly 
ions 


new 

not 
vil. 
10se 








MERIT ADMINISTRATION 107 


and the limitations on its effectiveness. As part of a career 
service, some method of evaluating the worth of those with the 
prospect of promotion is necessary. Recourse here to the basic 
principles of selection is indicated—written promotional ex- 
aminations with a broad base of competition, plus service or 
efficiency ratings as partial determiners of “the best available.” 

Service ratings leave much to be desired. They have inher- 
ent limitations. They put a premium on conventionality and 
often on the mediocrity that never swerves from the beaten 
path. They do not foster understanding of the brilliant em- 
ployee whose contribution this week is more than noteworthy 
but whose unexpected absence the following week leaves the ad- 
ministrator gnashing his teeth. Service ratings do, however, 
bring out into the relative open, so that impartial observers can 
see them, the supervisory judgments that would otherwise be 
made behind closed doors and acted upon arbitrarily in star 
chambers. Moreover, the rating process can be refined. If 
employees share in it and understand its advantages as well as 
its limitations, it can serve an exceedingly useful though limited 
function. 

(5) The last of the five basic principles is the right of appeal. 
This right rests on the theory that the individual under merit 
administration has basic rights that cannot be ignored and that 
the administrator has responsibilities that he must discharge. 
Appeals usually lie when an applicant has been refused the 
privilege of standing an examination; or if he feels that he has 
been unfairly rated on one or more parts of the selection 
process; or if he challenges the fact that he was not certified 
or that his name was improperly deleted from a register. Ap- 
peals frequently lie for layoff, demotion, suspension, and almost 
always for summary dismissal. 

Many defensible differences of opinion exist over particulars 
in every system of appeals: the actions covered, the nature and 
calibre of the appeals body and the manner of appointment, the 
question of informing the worker of his rights, and the question 
of the administrator’s responsibility once an appeal has resulted 
in a finding. But there is little if any disagreement among 
merit or civil service systems that the right of appeal is cardinal. 











108 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


In this article merit systems have been criticized—not that 
they may be destroyed but that they may be improved. The 
only alternative to selection by merit is selection by personal 
whim. In that direction lies chaos. Thus, society is left with 
the task of making its “merit system” more workable, more 
reliable, more internally consistent—in short, more nearly 
scientific. Enough experience with merit systems is now avail- 
able that some of the malfunctionings are apparent. Knowl- 
edge is available that could be used to correct these malfunc- 
tionings. Improvement of merit systems is contingent upon 
their scientific development; their acceptance is dependent 
upon a greater public awareness of what merit systems are; 
their perfection waits upon a demand that they be what they 
could be. 

In The Faith That Heals, Sir William Osler, noblest physi- 
cian of his day, says, “Nothing in life is more wonderful than 
faith—the one great moving force which we can neither weigh 
in the balance nor test in the crucible.” Through its faith in 
democracy this nation is philosophically committed to the 
merit principle. Imperfections are transitory. The “sub- 
stance of things hoped for” is scientific selection of the best 
qualified public servants. 








arly 
rail- 
owl- 


inc- 
pon 
lent 
are; 
hey 


ysi- 
han 
igh 
1 in 
the 
ub- 


yest 











HOW TEACHERS CAN IMPROVE THEIR TESTS?’ 


MAX D. ENGELHART?2 
Chicago City Junior Colleges 


Tue chief function of a teacher is that of directing and moti- 
vating pupils toward the attainment of desirable educational 
objectives. In the performance of this function testing can 
play a very important part. When objectives are adequately 
defined and tests devised which are valid with respect to them, 
the extent to which objectives are being attained can be mea- 
sured. Furthermore, such tests define the objectives for the 
pupils and motivate the pupils toward them. When the results 
of such tests are adequately analyzed and interpreted, the 
teacher obtains a means of better orienting instruction and the 
pupils secure motivation through knowledge of progress. 

Construction of exercises and analysis of the data resulting 
from their use can make objectives more definite and more 
meaningful to the teacher. The creation of novel exercises 
and the analysis of data pertaining to them may widen the 
scope of objectives recognized and ultimately realized in in- 
struction. 

Instructional objectives are most usefully defined in terms 
of observable behavior. Each specific objective should be an 
answer to the question “What should the pupils be able to do 
as a result of instruction?” Instruction which produces the 
abilities to do certain things, should concomitantly develop the 
attitudes, interests, and ideals which motivate their doing. 
Instead of the general and intangible objectives “good citizen- 
ship,” “appreciation of good literature,” and “scientific 
method,” specific objectives formulated in terms of observable 


1 Reprinted by permission of the Chicago Schools Journal. 
2On leave as a member of the Examinations Staff of the United States Armed 
Forces Institute. 


109 








110 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


behavior may include: “Presenting arguments in support of the 
elimination of the general property tax based on factual evi- 
dence critically analyzed and evaluated,” “selecting a short 
story for leisure reading on the basis of the following cri- 
teria. .. ,” “rejecting a conclusion which goes beyond the data.” 

Many instructional objectives must be concerned with fac- 
tual information, or knowledge. In the selection of facts con- 
sideration should always be given to the contribution such 
knowledge can make to the types of behavior illustrated in the 
preceding paragraph. The thinking necessary for adequate 
performance of activities recognized as the really worth-while 
objectives of instruction must be based on knowledge. _Infor- 
mation which makes no recognizable contribution to such 
thinking, or to the further learning which may contribute to 
such thinking, is not worth teaching. It is also not worth 
testing. 

While teachers usually contend that their objectives are 
not restricted to the memorization of miscellaneous, unrelated, 
and often trivial information, the tests used by most teachers 
are convincing evidence that their actual objectives are thus 
restricted. When objective tests are made by teachers the 
exercises are most often of the true-false or multiple-answer 
types and the content of these exercises is wholly based upon 
information given in the text. When essay tests are con- 
structed and used, the questions are factual in character and 
are scored only for the facts remembered. 

Tests should be designed which measure knowledge of facts. 
True-false or multiple-answer exercises can be efficient means 
of measuring such knowledge. The possibility of represen-_ 
tative sampling of pupil knowledge is one of the advantages of 
objective testing. In writing such exercises the teacher should 
be critical in the selection of the facts to be tested. One useful 
criterion in the selection of facts is the degree of relevance of the 
fact to some important general concept or principle which can 
be applied in the solution of some problem involving reflective 
thought. Consideration should be given to whether or not 
knowledge of the specific fact contributes to further learning. 
In many fields progress in learning is contingent upon the syn- 











f the 
evi- 
short 
cri- 
ata. 
fac- 
con- 
such 
1 the 
uate 
vhile 
ifor- 
such 
e to 
orth 


are 
ited, 
hers 
thus 

the 
wer 
pon 
‘on- 
and 


cts. 
ans 
en-_ 
s of 
uld 
ful 
the 
can 
‘ive 
not 
ng. 
yn- 








TEACHER TESTS 111 


thesis in the mind of the pupil of an ever growing body of 
factual knowledge. In thinking about an important con- 
temporary social problem the pupil may require a knowledge 
of numerous historical facts relevant to the trend which has 
created the problem. Series of factual objective exercises may 
be useful in determining the extent to which such knowledge 
has been attained. 

In writing true-false exercises certain precautions should be 
observed. Broad generalizations should usually be avoided. 
Such words as “always,” “never,” “none,” “only,” “all,” and 
“every” are obvious clues to the falsity of certain statements 
when the pupil can readily think of exceptions. On the other 
hand, the statement “all echinoderms live in salt water” rep- 
resents a difficult true-false item to one whose biological knowl- 
edge is not extensive. In multiple-answer exercises the incor- 
rect completions should not be too obviously incorrect. Each 
completion should be plausible. It is frequently effective to 
write multiple-answer exercises in which a “best” rather than 
a “correct” answer is called for. Such exercises tend to accom- 
plish more than the measurement of memorized information. 
All, or most, of the completions may be “correct”; the pupil 
must judge which completion is the “best.” In a recent social 
science comprehensive examination the directions for one series 
of multiple-answer exercises asked the student to identify the 
alternative suggesting the “most significant relationship” be- 
tween the two things mentioned. The first exercise in this 
series is given below: 

1. Minor party—social and economic reform: (A. Minor parties 
are usually characterized by radical platforms; B. minor parties sel- 
dom win an election; C. reforms agitated by minor parties are some- 
times adopted and enacted into law by the major parties; D. unsuc- 
cessful minor parties ultimately pass out of existence; E. the major 


parties have the advantage in organization, funds, and prestige and, 
hence, are more successful in promoting reforms. ) 


While most of the answers are “correct,” alternative “C” 
represents the most significant alternative. In a recent biolog- 
ical science comprehensive examination each exercise began 
with a statement to be explained. More than one of the alter- 
natives were “correct,” but only one was accepted as the 
“explanation.” The following exercise is an example: 














112 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Statement: In embryological development certain structures de- 
velop in a manner analogous to the conversion of a passenger boat 
into an aircraft carrier. (A. The morula changes into a blastula 
which changes into a gastrula; B. part of the placenta is formed from 
the wall of the uterus of the mother; C. the mesoderm gives rise to 
the heart, blood vessels, blood cells, lymphatics, kidneys, and certain 
other organs and structures; D. in the higher vertebrates, the middle 
ear cavity, the eustachian tube, the thymus gland, and the parathy- 
roid glands develop from gill slits or arches; E. the ectoderm gives 
rise to the epidermis and to the lining of the mouth and anal aper- 
tures. ) 

The answer “D” is the alternative which best explains the 
statement. This example illustrates another very important 
matter. If the teacher desires to measure how well pupils can 
handle questions involving thought rather than memory, an 
exercise must constitute a novel problem. It is improbable 
that the students were taught the particular analogous relation- 
ship implied in the above exercise. If they were taught this 
analogy, the exercise will measure only the extent to which the 
analogy is remembered. If, however, they were merely taught 
the facts represented by alternative “D” then the relating of 
this alternative to the introductory statement becomes an act 
of thought transcending the utilization of memory. The ex- 
amples just given also illustrate the fact that exercises which 
are simple in form can be effective in the measurement of be- 
havior involving thinking. It is not necessary to construct 
exercises involving complex directions in order to obtain such 
measurement. The important thing is that the content of the 
exercise should be to some extent novel to the pupils. 


Discriminative Thinking 


One means of securing the novelty referred to in the pre- 
ceding paragraph is to bring together in a series of objective 
items numerous facts whose classification in certain ways will 
involve the ability of the pupil to do discriminative thinking 
and to synthesize his knowledge. Let us suppose that pupils 
in physics have been taught certain facts pertaining to sound 
and, at a later date, certain facts pertaining to light. Let us 
suppose further that the teacher has not stressed the similarities 
and differences of sound and light phenomena. The good 




















le- 
pat 
ila 
ym 


in 
lle 
y- 
res 
2r- 


oOo FF 


ae 


“ 
$ 
1 
$ 
5 
I 

















TEACHER TESTS 113 


teacher would probably stress these things, but for the sake 
of our illustration, let us suppose that this is to be postponed 
until the test has been given. If this has been the case then 
the following series of exercises should involve discriminative 
thinking and, as this thinking is taking place, require a synthe- 
sizing of knowledge by the pupil: 


On the line preceding each of the following items, write the letter 
A if the item is true of sound 
if the item is true of light 
if the item is true of both 
an Its velocity in water is greater than in air. 
x” It can be reflected. 
.. It can travel through a vacuum. 
... It can be refracted. 


swine RR 


A number of examples of this very useful general type of 
exercise are given below from various fields. In each case only 
a few of the items are listed. 


In each situation below, an individual or a group of individuals is 
seeking protection or assistance. On the line preceding each of the 
following items, write the letter 


A if the Federal government is responsible 

B if the state government is responsible 

C if both governments are responsible 

D if neither government is responsible 

. The New York Life Insurance Company wishes to open up 

agencies and sell insurance in Oregon. 

aid Mr. Jones receives in payment $1,000 in bills which he presently 
learns are counterfeit. 

nem A Chicago visitor from Fort Wayne. Indiana, suffers severe in- 

juries when his car is wrecked because of defective pavement on 

76th Street. 

Convicts escaped from Joliet Penitentiary arrive in Des Moines, 

Iowa, hold up a bank, and are seized and held by the local au- 

thorities. 


ae Etc. 
On the line preceding each of the following items, write the letter 
A if the sentence is fragmentary 


B if the sentence contains a comma fault 


C if the sentence contains a dangling modifier 








114 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


D if the sentence exemplifies lack of parallelism or faulty parallelism 
E if the sentence is correct 
One day I feel as though I could lick the world, the next day I 
feel like a swatted fly. 


... Upon returning from the store, my homework requires my 
attention. 


As we walked along the hall, where a large photograph of Roose- 
velt hung. 


... All of the boys being gone, most of the manual labor was done by 
the girls. 


... Bemoaning her stupid lot, the army and navy claimed all of her 
friends. 


... A freshman learns to study with regularity, to play with enthusi- 
asm, and co-operation. 


Re Etc. 


On the line preceding each of the following items, write the letter 
A if the statement is true of the Lycidas 
B if the statement is true of the Adonais 
C if the statement is true of both works 


D if the statement is true of neither work 
... The poem was written in memory of a dead friend. 

In form as well as content the poem displays a deep personal grief. 
... The poet ornaments his verse with many classical allusions. 


Etc. 


The following type of exercise can be used in a variety of 
school subjects and is effective in measuring how students can 
make comparisons: 

On the line preceding each of the following paired items write the 
letter 
A if the item at the left of the page is of greater magnitude than the 
~~ item at the right 
B if the item at the right of the page is of greater magnitude than the 


item at the left 
C if the two items are of equal magnitude 


. Amount of glucose in blood..... Amount of glucose in the 
entering the liver 4 hours after blood leaving the liver 4 hours 
a meal is eaten. after a meal is eaten. 
. Amount of absorption of foods..... Amount of absorption of foods 
by stomach. by intestine. 
se Percentage of urea in blood.....Percentage of urea in blood 


entering the liver. leaving the liver. 











— 
~~ 


~~ weet 


aaa a 





y I 
my 


se- 


Si- 


er 


of 


1€ 


ie 














TEACHER TESTS 115 


Amount of heat produced in..... Amount of heat retained in 
the body. the body. 
Etc. 


The same form needs only slight adaptation to be useful in 
writing chronology exercises in history. 


On the line preceding each of the following paired items write the 
letter 
A if the event in Column I occurred before the event in Column II 
B if the event in Column II occurred before the event in Column I 
C if the events occurred at approximately the same time (within 
about a year of each other) 


Cotumn I Cotumn II 
Clayton Antitrust Act .......... Sherman Antitrust Act 
Alabama Claims Case ........... Venezuelan Arbitration 
Dred Scott Decision ............ Fugitive Slave Act 
“Wigwam Convention” ......... Fort Sumter fired upon 


Etc. 


It should be mentioned that there should be some relation- 
ship between the paired events significant enough to warrant 
their being paired. Care should be exercised when writing the 
“C” items to give events that occurred simultaneously, or very 
nearly simultaneously. Note the qualifying remark with re- 
spect to category “C” in the directions stated above. 

Exercises of similar form are useful in measuring the way in 
which pupils handle correlated or cause and effect relationships. 
In writing such items where the relationship is definitely cause 
and effect, the cause should be given first. 

On the line preceding each of the following paired items write 
the letter 
A if increase in one of the things referred to is usually accompanied 
~~ by increase in the other 
B if increase in one of the things referred to is usually accompanied 
~~ by decrease in the other 


C if one of the things referred to tends to remain the same when the 


other increases or decreases 
. Amount of carbonates dissolved in the water of a river. Number 
of clams in the river. 


. Temperature of the environment of a bird or mammal. Body 
temperature of the bird or mammal. 














116 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


. Amount of dissolved salt (sodium chloride) in a given body of 
water. Number of amphibia in the water. 


. Extent to which men make changes in an area. Rate at which 
the area tends toward balanced equilibrium. 


. Ete. 


On the line preceding each of the following paired items, write the 
letter 


A if significant increase in one of the things mentioned has usually 
~~ been accompanied by significant increase in the other 

B if significant increase in one of the things mentioned has usually 
~~ been accompanied by significant decrease in the other 

C if one of the things does not tend to change significantly when 
~~ significant change takes place in the other 


Dollar diplomacy. Confidence in the United States on the part 
of the South American Nations. 

. Efforts to eliminate the poll tax. Oratory on state rights by 
Southern senators. 

. Efforts to liberalize the Supreme Court. Expressions of con- 
servatism with respect to the issue on the part of the general 
public. 

... Acceptance of the doctrine of checks and balances. Lobbying. 

. Bee. 

Intelligent Reading 


The ability to read intelligently in a field is an important 
general objective and more efforts should be made to measure 
how well pupils can perform this activity. This ability is a 
very important factor in further learning within a field and in 
dealing with practical problems when formal school learning 
has terminated. The following exercise illustrates one means 
of measuring such ability. In this example, the material to be 
read is a graph. Such exercises need not be so restricted. 
One may use paragraphs drawn from texts other than those 
studied by the pupils and even from advanced texts if the 
teacher wishes to challenge the pupils. When paragraphs are 
selected they should be self-contained in that the topic is 
treated fairly completely. It is frequently effective to present 
paragraphs which give scientific data and some of the state- 
ments listed may be inferences which may justifiably be derived 
from the data. Others of the statements may be only partially 
justified or may be irrelevant. Some of the items of the fol- 


















TEACHER TESTS 117 
ly of lowing exercise represent predictions which may not be justi- 
; fied: 
hich 
> the TOTAL EXPENDITURES 
State of Dlincis 

lally 

ally 

yhen : 

y 100 
part i & 
by 7 
[*e) 
on- 
eral 20 
a2 
fe] 
. 70 nD RD & 8 6 RM B WH 
Piscal Years Ending June 1 
On the line preceding each statement write the letter 
A if the information given in the chart is sufficient for a judgment 
ome ~ that the statement is definitely true 
aoe B if the information given in the chart is sufficient only to indicate 
des ~ that the statement is probably true 
in C if the information given in the chart is sufficient for a judgment 
ing ~ that the statement is definitely false 

ins D if the information given in the chart is sufficient for a judgment 

be ~ that the statement is probably false 

od. E if the information given in the chart is not sufficient to indicate 

se ~~ any degree of truth or falsity in the statement 

he of Se Less money was spent in 1930 than in 1929 for welfare and educa- 

- tion. 

so ee ee In 1931 and 1932 the expenditure of money for highway purposes 
was evidently considered a means of combatting the Depression. 

nt ... In 1940 a much greater proportion of the total expenditures was 

e- for welfare than in 1942. 

mt) tae. Had our country not entered the war in 1941, expenditures for 

ly ‘ welfare in 1942 would have been greater than in 1940. 

‘a oo The increasing amount of money spent by the State for all pur- 
poses between 1929 and 1940 must have come largely from taxes 
or from Federal grants-in-aid rather than from borrowing. 

.. Ete. 











118 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Such exercises can be scored in the usual way to obtain a 
total score which shows how well the students agree with the 
key set by the teacher, or the “expert thinker” in the field. 
Such a total score may answer the questions “To what extent 
does the pupil read data accurately?” and “How well does the 
student recognize valid generalizations drawn from data?” It 
is possible, however, to obtain other scores which reveal certain 
characteristics of pupil thinking. For example, a count of the 
number of “B” items marked “A,” and “D” items marked “C,” 
and “E” items marked with something other than “E” is in- 
dicative of the extent to which the pupil tends to go beyond the 
data. A relatively low score would be indicative of relatively 
greater maturity in thinking with data. Such analysis is one 
of the characteristics of the handling of the test data in the 
Eight Year Study of the Progressive Education Association 
and in the Cooperative Study in General Education of the 
American Council on Education. The following statement ap- 
pears in a recent report of the Board of Examinations of The 
University of Chicago: 

The modern tendency is to contruct the items so that the wrong 
responses are wrong in a specific way—for example, a definition which 
is too broad, or a definition which is inadequate. When alternative 
answers are so chosen, a regular pattern of incorrect responses is 
established which upon analysis yields much more information about 
the students’ mental habits than did incorrect responses constructed 
by the older methods. By means of this pattern type of analysis, it 
is possible to determine whether students’ errors in reading and inter- 
preting data consist in saying that certain things are in the data 
which in fact are not, or in saying that certain things are not in the 
data which in fact are. It may be found that students are able to 
select statements which agree with the data presented but have difh- 
culty with statements which disagree with the data presented. 

It is difficult to write exercises of the types described. Such 
exercises must be written and used, however, if pupils are to 
seek worth-while objectives and if the degree of attainment of 
worth-while objectives is to be measured. 'Where measurement 
is restricted, objectives are also restricted. 

When a teacher has written such exercises it is essential to 
secure careful evaluation by other teachers who have the same 
general objectives. The best evaluating is done when the 








SS 4A F&F nD Ff A wf ff 


nan FF ff» ® * 














TEACHER TESTS 119 


evaluating teacher does not have access to the key, but answers 
the items herself. Comparison of several such evaluations is 
valuable in the rejection of bad items and in the revision of 
others. For example, in a biological science examination the 
students were asked to mark certain items true or false on the 
basis of the “principles of inheritance.” The following item 
was accepted as false by several biological science instructors: 
“Boys tend to resemble the father, while girls tend to resemble 
the mother.” However, one instructor pointed out that in a 
certain important respect boys always resemble their fathers 
and girls always resemble their mothers. In the directions pre- 
ceding the items the following qualifying phrase was added to 
take care of the situation: “excluding primary and secondary 
sex characteristics.” 

The preceding paragraphs have dealt exclusively with ob- 
jective exercises. In any balanced program of testing some es- 
say exercises should be included. In writing such exercises 
“fact” questions should be avoided. Objective exercises can 
test knowledge of facts more efficiently and representatively 
than essay questions. Essay exercises should represent novel 
problematic situations. For example, the following essay ex- 
ercise appeared in a recent physical science comprehensive 
examination: 

In 1492, Christopher Columbus began his voyage of discovery by 
sailing southwest to the Canary Islands which are near the coast of 
Africa and in latitude 28° N. He then continued his voyage by sail- 
ing westward in that part of the Atlantic Ocean between the equator 
and 30° N. On his return trip to Spain, early in 1493, he first sailed 
northeast until he was somewhat more than 30° N. and then sailed 


east to Spain. On the basis of information given in your physical 
science text, explain why Columbus sailed as described above. 


Several blank lines followed this exercise in the test booklet. 
Nothing is said about Columbus in the physical science text, 
but information on the belts of the winds is given which could 
be applied by the student in responding to the question. Co- 
lumbus took advantage of the northeast trades in his voyage 
to the New World and of the prevailing westerlies on his return 
to Spain. 

Essays have been based on selections of quoted material 
and on cartoons. In the field of English composition it is 





120 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


effective to have essays based on notes presented for reading 
at the time of the examination, or prior to the time of examina- 
tion. 

Information is needed for the writing of a correct response 
to essay exercises which are thought questions or novel prob- 
lems, but the quality of the response will also depend upon the 
extent to which the student critically analyzes the situation 
and thoughtfully organizes his information in an effort to 
meet it. The scorers should be sensitive to more than the cor- 
rectness of the information presented by the student. Their 
ratings should be based not only on the correctness of the facts, 
but also upon evidences of superior selection, evaluation, and 
organization of the facts presented. Comparison of student 
responses with a scale of responses to the same or to a similar 
question may be found to be an effective procedure. Another 
possibility is the use of directions for scoring in which such 
characteristics as organization and originality are defined and 
illustrated, and suggestions are made with respect to the weights 
to be given to each such characteristic. 


Analysis of Test Scores 


After a test has been given and scored the test data should 
be subjected to analysis. Analysis is essential if the teacher is 
to know the extent to which objectives are being attained. 
One type of analysis has been referred to in the paragraph fol- 
lowing the exercise based on a graph. It is also very desirable 
to determine the per cent of correct response to each exercise 
for the group taking the test. A low per cent of correct re- 
sponse may indicate the need of further instruction. In some 
cases low per cents of correct response may warrant the rejec- 
tion of such items for use in testing subsequent classes, or the 
omission of such subject matter as inherently too difficult. 
The analysis may be extended further than merely determining 
per cents of correct response. One can, with a little labor, de- 
termine how well each exercise correlates with whatever is 
measured by the test as a whole. 

One way to do this type of analysis is to separate the papers 
into two groups. The “upper group” contains all papers above 








in, 


the 
mu 





rs 
ve 


























TEACHER TESTS 121 


the median score of the test as a whole while the “lower group” 
contains all papers with total scores below the median total test 
score. Taking one test paper at a time and opposite the num- 
bers of the exercises on a tally sheet, the teacher tallies for each 
exercise correctly answered. For example, if the first paper 
has correct answers for exercises 1, 2, 5, 7, and so on, the teacher 
makes a tally mark after these numbers on the tally form. 
Another form is similarly prepared for the “lower group.” The 
per cents of correct response for each of the groups are then 
computed. (Samples of 100 papers in each group avoid the 
necessity for such a conversion.) Since there are equal num- 
bers of papers in the upper and lower groups corresponding per 
cents may be averaged to obtain the per cents of correct re- 
sponse for the entire group taking the test. The per cents for 
the upper and lower groups are used in reading the correlation 
coefficient from the abac shown on page 122. 

For example, suppose that 65 per cent of the upper group 
answer exercise 17 correctly while only 25 per cent of the lower 
group doso. ‘The correlation between success or failure on the 
item and the total score on the test is + .60. Such an item 
makes a significant contribution to whatever is measured by 
the test asa whole. This correlation can be seen in the follow- 
ing table which need not be constructed for each item, but 
which is useful in explaining the above. 

FAILURE ON ITEM SUCCESS ON ITEM 





ABOVE MEDIAN 3 5 6S 





r=+ 6 


BELOW MEDIAN 75 2 5 














Take the per cents 80 and 55 as another example. Here 
the intersect does not fall on one of the lines in the chart. One 
must interpolate between the lines labeled r= -+ .40 and r= 


122 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


+ .50. The r in this case is approximately + .43. (One inter- 
polates along the imaginary line perpendicular to these curves. ) 
When an exercise drops below + .20 in correlating with the 
total test score the inference is that the exercise is not making 
a significant contribution to the test. The first time a care- 


ABAC FOR ITEM-TEST CORRELATION 


PERCENTAGE OF LOWER GROUP PASSING THE ITEM 


PERCENTAGE OF UPPER GROUP PASSING THE ITEM 





Alter Mower and McQuotty 


fully consructed test is given it is not unusual to find that one- 
fourth or one-third of the exercises drop below this value. 
Certain exercises may yield negative correlations. For ex- 
ample, suppose that only 25 per cent of the upper group an- 
swered exercise 48 correctly while 65 per cent of the lower group 
responded correctly. The correlation is — .60 and this nega- 
tive relationship is illustrated by the following table: 








~-—S—lC rrelCcrlClCFFlCrlhlhlCl rl llr lCUrlCUrlhlC(Cir rh] lCitCir SS lUCUrhlhTFUlC<ir|EC(i‘iC Kk TCU 


A fb «- ff —- Pm 





iter- 
res.) 

the 
king 
‘are- 


ian 
$ & 


co Ee ee | 


oo 


a a ee ee 


| 
€ & 





ne- 
ue. 
CX- 
an- 
up 


ga- 














TEACHER TESTS 123 


FAILURE ON ITEM SUCCESS ON ITEM 





ABOVE MEDIAN 75 25 





BELOW MEDIAN 
35 65 














Low or negative correlations most often indicate that an 
exercise is bad. The key may be wrong, the exercise may be 
ambiguously stated, or the exercise may be much too easy or 
much too difficult. Analysis of responses to other alternatives 
can determine a better key when the exercises are of the type 
requiring judgments; for example, the type of exercise illus- 
trated with reference to the reading of the graph of state expen- 
ditures. In some cases a low or negative correlation does not 
necessarily mean that the exercise is a poor one. It is possible 
that the exercise is measuring some trait which is worth while in 
itself, but which is not related to whatever is measured by the 
test as a whole. In every case the teacher should study the 
exercise in relation to the total per cent of correct response and 
the correlation coefficient, and should formulate judgments with 
respect to the merit of the exercise and the degree of attainment 
revealed by the data. 

The type of analysis just described applies to objective ex- 
ercises. No definite procedure can be suggested for essay ex- 
ercises. When the scores on a given exercise range over several 
points, ordinary Pearson product moment coefficients can be 
calculated between the score on the exercise and the score on the 
test. It would seem much more useful, however, to analyze 
essay responses in terms of some classification with respect to 
various types of merit or of limitations. 

The discussion in the preceding paragraphs may seem to be 
carrying the “improvement of tests” a bit too far for the class- 
room teacher. It may seem that an inordinate amount of work 








124 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


and cerebration is involved. If, however, teaching is as much 
worth doing as the producing of manufactured articles or mate- 
rials where the processes are controlled from stage to stage, 
then the labor involved is more than justified. The teacher 
who is willing to do these things will find that the “improve- 
ment of tests” is not merely that. It is also the improvement 
of teaching. 

















PREDICTION OF COLLEGE SUCCESS BY MEANS OF 
THURSTONE’S PRIMARY ABILITIES TESTS* 


CHARLES H. GOODMAN 
The Pennsylvania State College 


In his 1940 monograph on factor analysis, Wolfle (18) 
states: “Few attempts have been made to put the results of 
factor analysis to practical use. Most of these few have dealt 
with intelligence testing. . . . Some studies, Chein (5) and 
Schneck (9), have indicated that achievement in individual 
courses can sometimes be predicted as well [as], or better, by 
special tests or by specialized subtests than by the total score 
of a general intelligence test. Which tests will best predict 
grades in individual courses is an empirical problem. As a 
start toward answering that problem, Thurstone has prepared 
a battery of sixteen tests, giving perception, number, verbal, 
space, memory, induction, and deduction scores (15). He 
warns (16) that this battery is still in the research stage and is 
not ready for routine use. It has been used by Bernreuter (3) 
and Stalnaker (11), but no data have been published to indi- 
cate how well it answers the administrator’s wish for better 
methods of predicting scholastic achievement.” 

This paper is a report upon the studies conducted at The 
Pennsylvania State College spegifically related to the possi- 
bilities of using the Thurstone Primary Abilities Tests for the 
purpose of predicting scholastic achievement; it provides, in 
some measure, a partial answer to the statement made by 
Wolfle (18) that there is a need for data to indicate how well 
these tests answer the administrator’s wish for better methods 
of predicting scholastic achievement. 

1The writer wishes to acknowledge his thanks and appreciation to Marianne 


Hessemer, Virginia Dickey Tredick, Isabella Waddell White, and Fred J. Ball for 
their permission to report the findings of their studies related to this paper. 


125 











126 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Experiments and Findings 


Bernreuter and Goodman (4) conducted a study in 1939 
to determine how well they could predict the success of fresh- 
men engineers at The Pennsylvania State College by means of 
the Thurstone Primary Mental Abilities Tests. One hundred 
seventy freshmen engineers served as their subjects. Bern- 
reuter and Goodman then obtained correlations for the four 
college courses of Chemistry, Drawing, English Composition, 
Mathematics, and semester point average’? with the Primary 
Abilities Tests. They then calculated multiple correlations, 
using various combinations of the abilities, with semester point 
average, Chemistry, English Composition, and Mathematics. 
The results are shown in Table 1. 


TABLE 1 


Correlations and Multiple Correlations of the Primary Abilities with College 
Courses in Engineering 











Semester English 
Ability point Chemistry Drawing Composi- Mathematics 
average tion 
P + .04 +07 + 00 + .05 + 04 
N + 32 +27 + 01 + 26 + .27 
Vv + .33 + .32 + O01 + 44 + .16 
S + 23 + 19 +11 + 11 +.25 
M + 10 + 04 +.11 + 23 — .05 
I + 34 + 23 +18 +21 +29 
D + .38 + 41 +15 +.21 + 44 
NVSID +51 Sia a er nae 
NVID ie + 49 ~_— 
NVMID ie + 49 ate 
NSID ve. + 49 





The highest single correlation obtained was + .44 for reason- 
ing with Mathematics and + .44 for verbal ability with English 
Composition. By using a combination of the number, space, 
induction, and reasoning abilities, and calculating a multiple 
correlation with Mathematics, they were able to obtain a mul- 
tiple correlation of + .49. In the case of English Composition, 
by combining the abilities of number, verbal, memory, induc- 
tion, and reasoning, they obtained a multiple correlation of 


2 Semester point average is obtained by dividing the number of credits earned 
by a student into the total number of grade points earned. While the shortcomings 
of semester point average as a criterion for college success are fully recognized, it 
appears to be the best possible one available at The Pennsylvania State College. 











939 
sh- 
of 
red 
m- 
yur 


ry 
ns, 
nt 
cs. 


~~ 


‘Fy ow 








PRIMARY ABILITIES TESTS 127 


+.49. In both cases the increase in the size of the correlation 
is slightly more than that obtained using the single abilities of 
reasoning with Mathematics, +.44, and verbal ability with 
English Composition, + .44. The best multiple correlation ob- 
tained, +.51, was with first-semester point average, using the 
abilities of number, verbal, space, induction, and reasoning./ A 
more detailed report of this work has been published in an 
earlier paper by Bernreuter and Goodman (4). 

During 1940 Ball (1) administered the Thurstone Primary 
Mental Abilities Tests to a group of 147 female freshmen and 
159 male freshmen attending the Liberal Arts School of The 
Pennsylvania State College. He then correlated their test 
scores for each of the seven primary abilities with the college 


TABLE 2 


Correlations of Thurstone’s Abilities with First Semester Point Average and Grades 
in Nine Liberal Arts College Courses 











P N V S M I D 
Semester point average..... .15 24 35 04 28 28 27 
| "Ea es ea 14 02 24 ll 18 .16 ll 
nr eee 13 12 .28 AT .18 28 23 
PE CO. ocicsnkenns 07 .14 40 09 .26 23 13 
BY Wh vnwacwteasasneene 08 Bi 27 =—-.10 19 02 01 
SET re cree 13 14 36 8 ©=6-.12 29 12 22 
DD he ccwusccees de — .03 Al 20 -.10 16 22 35 
MEL 6i0cuaekadesnanman 14 .20 24 06 18 26 31 
| Xk 7 Rea ee 00 15 37 + =-02 32 31 27 
MEE d.biccdkinagmiiwssaes 14 23 28 8 --.04 34 24 ll 





courses of Art, Botany, English Composition, French, History, 
Mathematics, Physical Science, Political Science, Zoology, and 
with their first-semester point average. Ball’s findings are 
shown in Table 2. 

It will be seen from Table 2 that the range of correlations 
found by Ball is from —.12 to +.41. The highest correlation 
obtained was + .41 for number ability with Mathematics. The 
next highest correlation, + .40, was that of verbal ability with 
English Composition. Verbal ability correlates more highly 
and positively with each of the courses than does any of the 
other abilities. The highest correlation Ball obtained for any 
of the abilities with Semester Point Average was + .35 with ver- 
bal ability. In a further effort to determine whether he could 











128 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


increase his predictive possibilities with the Thurstone tests, 
Ball computed a multiple correlation of the memory, number, 
verbal, induction, and reasoning abilities, using as his criterion 
semester point average. The multiple correlation was + .46. 
Ball’s results show that optimum weights yielded only a slight 
increase in correlation over that obtained for the single factor 
of verbal ability, which correlated +.35 with semester point 
average. ” 

Hessemer (8), in a study similar to Ball’s, attempted to 
determine the predictive possibilities of the Thurstone tests in 
the School of Chemistry and Physics at The Pennsylvania 
State College. Upon administering the Abilities Tests in 1942 
to 147 freshmen students, she correlated their test scores on the 
primary abilities with their first-semester point average and 
with the course of Inorganic Chemistry which each of her sub- 
jects had taken. Inorganic Chemistry at The Pennsylvania 
State College is the first course in chemistry taken by freshmen 
students. It is both intensive and extensive in scope, as can 
be seen from the following description in the College catalogue: 

Inorganic Chemistry (5).—The nonmetallic elements; fundamen- 
tal principles of the science are studied in connection with the descrip- 
tive chemistry of nonmetallic elements and their compounds; prepares 
for future study of the science. Lecture 2 hours, recitation 2 hours, 
practicum 3 hours. 

The results obtained by Hessemer are presented in Table 3 
and show, as did Ball’s results, verbal ability to be the best 
single predictor of semester point average. The best positive 
correlation, + .18, was that of reasoning with Inorganic Chem- 
istry. This correlation is slightly larger than four times the 
size of its probable error. Interestingly enough, the largest 
correlation found was negative, — .25, for space with Inorganic 
Chemistry. When each of the primary abilities was correlated 
with semester point average, the highest correlation obtained 
was that of + .44 with verbal ability, while reasoning correlated 
next highest, + .40. 

In 1941 Tredick (13) conducted a rather thorough study 
of the predictive possibilities of the Thurstone Primary Mental 
Abilities Tests and a battery of vocational guidance tests. Her 
guidance test battery consisted of the Otis, the Pressey, the 











ests, 
ber, 
rion 


ight 
>tor 
int 





PRIMARY ABILITIES TESTS 129 


TABLE 3 


The Primary Mental Abilities Correlated with Semester Point Average in Chemistry 
School and Inorganic Chemistry 














Semester Inorganic 

point average Chemistry 
P — .008 -.22 
N + .23 + .09 
V + 44 + 13 
S — .08 - 25 
M + 36 + .05 
I +.25 — .02 
D + 40 + .18 





Minnesota Paper Form Board, the Minnesota Clerical Test, 
the Meier-Seashore Art Judgment Test, the Minnesota Assem- 
bly, and the Minnesota Spatial Relations Test. Her subjects 
consisted of 113 freshmen women in the Department of Home 
Economics at The Pennsylvania State College. After adminis- 
tering these tests she correlated the seven abilities with the 
grades made by her subjects on the five college courses of Art, 
Chemistry, English Composition, Home Economics 101, Home 
Economics 109, and first-semester Point Average. Her find- 
ings are shown in Table 4. 

It will be observed from Table 4 that the range of correla- 
tions is from — .02 for memory correlated with Art, to + .55 for 
verbal ability correlated with English Composition. It will 
be noted that Tredick’s results, like Ball’s and Hessemer’s, 
showed verbal ability correlated more highly with each of the 
courses than did any of the other abilities. The highest any of 


TABLE 4 


Correlations of the Primary Mental Abilities with College Grades and Semester 
Point Average in Home Economics 








Home Home 





English Semester 
- = — Compo- aoe fee point 

ry sition 101 109 average 
P 15 20 19 31 18 28 
N 11 46 22 .20 33 41 
V 24 28 55 50 37 51 
S 25 23 10 .22 16 28 
M — .02 25 08 Aa 15 20 
I 26 37 19 35 26 40 
D 21 43 21 24 33 42 











130 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


these abilities correlated with Semester Point Average was + .51 
and again it was verbal ability. 

Tredick then calculated the correlations of the battery of 
vocational guidance tests taken by her subjects with the same 
five college courses which were used to obtain the correlations 
with the Thurstone tests, and semester point average. The 
findings are shown in Table 5 and afford some interesting com- 
parisons with those obtained with the Thurstone A bilities Tests. 


TABLE 5 


Correlations of the Vocational Guidance Battery with College Grades and Semester 
Point Average in Home Economics 











. Home Home 
Art Chem- English Eco- Eco- ame ng 
76 istry a nomics nomics — 
sition 101 109 average 
Paper Form 
OS ae 24 24 13 34 17 31 
err 20 38 48 Al 43 51 
ae 17 37 54 48 Al 53 
Name Checking. .07 26 07 08 17 .23 
Number 
Checking .... .08 31 27 26 .26 36 
Meier-Seashore 
Art Judgment. .20 03 29 34 16 23 
Minnesota 
Assembly .... .17 16 -.01 06 01 at 
Spatial 
Relations .... .20 22 02 27 09 23 





The highest correlation obtained for a single test of the 
vocational guidance battery with college grades was that of the 
Otis (Higher Examination, Form A), which correlated +.54 
with English Composition. The next highest correlations ob- 
tained were for the Otis with Home Economics 101, + .48, and 
the Pressey with English Composition, + .48. It will be noted 
that both of the group tests for mental ability, the Otis and 
Pressey, correlate higher with these college courses than do any 
of the other tests of the vocational guidance battery. Further- 
more, the Otis correlates more highly with Semester Average, 
+ .53, than do any of the other tests of the vocational guidance 
battery. The Pressey correlates only slightly lower than the 
Otis with Semester Average, +.51. Comparing Table 4 with 
Table 5, it will be seen that the verbal factor is the only ability 

















PRIMARY ABILITIES TESTS 131 


that appears to correlate as well as the Otis or the Pressey Tests 
with the various courses. The Otis correlates slightly higher, 
+ .53, than does verbal ability, + .51, with semester point aver- 
age. However, the correlation with semester point average of 
the Pressey is the same, + .51, as that for verbal ability. 

In an effort to determine the relationship between her sub- 
jets’ scores on the Primary Abilities Tests and their scores on 
the vocational guidance battery, Tredick calculated the corre- 
lations for each of the tests. Her results are shown in Table 6. 


TABLE 6 


Correlations of the Thurstone Primary Mental Abilities with A Vocational Guidance 
Test Battery 











P N V S M I D 

ee 48 38 76 40 27 52 61 
RS acest ied ca 53 33 68 40 29 60 61 
Revised Minne- 

sota Paper 

Formboard .. .39 21 24 37 06 48 AS5 
Minnesota 

Assembly ... .23 —.12 All 34 07 26 30 
Spatial Rela- 

OS ree 55 15 16 49 06 47 33 
Number 

Checking.... .51 59 06 36 .20 .28 28 
Name Checking .57 58 40 Al .24 44 46 
Meier-Seashore 

Art Judgment. .33 mi 39 .20 18 23 15 





The data of Table 6 offer some indications why the correla- 
tions of Thurstone’s verbal ability with the five college courses 
are so similar in size to those correlations obtained for the 
Pressey and Otis with the same college courses. Of the seven 
abilities, verbal ability has the highest correlation of the entire 
table, + .76, with the Otis Test. The next highest correlation, 
+ .68, also involves verbal ability with the Pressey. It would 
appear from the correlations that both the Pressey and the Otis 
contain materials similar to those found in Thurstone’s verbal 
ability. It also seems that the materials of the reasoning, in- 
duction, and the perceptual factors overlap the materials of 
the Otis and Pressey Tests, as can be seen from the correlations. 
It may be that the perception ability is related to the speed 
factor found to operate in timed tests such as the Otis and 











132 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Pressey. Memory ability correlates only slightly with the Otis 
and the Pressey. Induction and reasoning abilities correlate 
highest with the Revised Minnesota Paper Form Board. The 
primary abilities appear to correlate only slightly with the 
Minnesota Assembly Test. The Minnesota Spatial Relations 
Test correlates best of all with perception ability, but space and 
induction also appear to overlap to some degree with this test. 
On the Minnesota Clerical Test, Name Checking correlates 
highest with the perception and number abilities. On Number 
Checking, the five abilities of verbal, space, memory, induction, 
and deduction do not appear to be related. On the other hand, 
four of these five abilities, with the exception of memory, appear 
to be related. 

Finally, Tredick combined the abilities of number, verbal, 
induction, and deduction, using as her criterion semester point 


TABLE 7 


Correlations for Each of the Abilities with the First Year's Semester Point 
Average in Engineering 








Ability P N V S M I D 





Coefficient of 
Correlation +.08 + 26 + 34 +.18 +11 +.30 + 36 





average, and obtained an R of + .61. Similarly, she combined 
the Otis, Pressey, Minnesota Paper Form Board, and Number 
Checking of the vocational guidance battery, using as her cri- 
terion semester point average, and obtained an R of + .57. 

Goodman (7) made a second attempt in 1940 to determine 
the predictive value of the Abilities Test in determining college 
success for his engineer subjects at the completion of their first 
year in college. Each of the factors was correlated with the 
first year’s semester point average. The results are shown in 
Table 7. 

The abilities of reasoning, induction, verbal, space, and 
number were then used in a multiple correlation with first 
year’s semester point average and the combined variables 
yielded a multiple correlation of +.49, which is slightly lower 
than that of + .51 obtained for the first semester. 

Goodman (7) then sought to determine whether he could 








Otis 
elate 
The 
the 
ONS 
and 
test. 
lates 
nber 


les 
rer 


ld 











PRIMARY ABILITIES TESTS 133 


obtain better predictive values for his engineer subjects by 
using the sixteen Thurstone tests individually. These tests, 
when combined, yield the score that is the measure for the abil- 
ity. According to Thurstone (14), each of the tests designed 
to measure the particular ability is highly saturated with that 
ability. The correlations between the tests that measure these 
abilities are shown in Table 8. The tests of the three abilities 
of number, verbal, and space appear to correlate highly with 
each other, while the correlations of the tests of the other abili- 


ties descend in size. 
TABLE 8 


Tests Measuring Each of the Abilities and the Correlations for the Tests of 
Each Ability 











Coefficient 
Factor No. Test bt comin P.E. 
P 1 Identical Forms + .32 05 
2 Verbal Enumeration 
N 3 Addition +.72 02 
+ Multiplication 
V 5 Completion +61 .03 
6 Same-Opposites 
S 7 Cards + 63 03 
8 Figures 
M 9 Word Number + 35 05 
10 Initials 
I 11 Letter Grouping +28 (11 & 12) 05 
12 Marks +.33 (11 & 13) 05 
13 Number Patterns +.22 (12 & 13) 05 
D 14 Arithmetic +44 (14 & 15) 04 
15 Number Series +.15 (14 & 16) 05 
16 Mechanical Movements +.01 (15 & 16) 05 





The abilities tests were then correlated with the first year’s 
semester point average and the zero-order coefficients were ob- 
tained. For the purpose of comparison, the zero-order coefh- 
cients for the tests and the ability zero-order coefficients are 
given in Table 9. The following facts are to be noted: both 
tests of the perceptual factor correlate only slightly with the 
criterion as does the factor itself, but it can be seen that the 
Identical Forms Test alone correlates higher than the composite 
of the ability tests, while the correlation for the Verbal Enumer- 
ation Test is low in value. In number ability the r for the 








134 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


single test of Addition is as high as the r for the factor itself and 
the Multiplication Test r is not much smaller. In the verbal 
ability, the Completion Test r is again almost as high as the abil- 
ity r. In space, the correlation for the test of Cards alone is 
greater than that for the ability, while the r for the second 
test of Figures is of low value. Both tests of memory cor- 
relate very slightly with the criterion while the r for the ability 
itself is slightly greater than the r for the Word Number Test. 
The three tests of induction have r’s of approximately the same 


TABLE 9 


Zero-Order 1's of the Sixteen Thurstone Tests with the Criterion 











Factor 7 Test r with 
Factor Test with criterion criterion 
P Identical Forms + 09 
Verbal Enumeration + .08 + 03 
N Addition + .26 
Multiplication + 26 + 24 
V Completion + 31 
Same-Opposites + .34 + .30 
S Cards + .23 
Figures +.18 +.10 
M Word Number + .10 
Initials +.11 + .08 
I Letter Grouping + 26 
Marks + 20 
Number Patterns +31 + 24 
D Arithmetic + 32 
Number Series + 32 
Mechanical Movements + 36 +.10 





size, while the r for the ability is somewhat larger than that for 
any of the tests. In reasoning, the r for the test of Arithmetic 
is almost as large as the r for the ability, with the Nwmber Series 
Test r slightly smaller. The test of Mechanical Movements 
correlates slightly with the criterion and with the other two 
tests of the ability. Upon obtaining the zero-order coefficients 
between the tests and the criterion, a multiple correlation of 
eleven of the tests was computed. The eleven tests were 
selected which correlated highest with the criterion and lowest 
with the other tests. The tests selected were Arithmetic, 
Number Series, Same—Opposites, Completion, Letter Group- 








a af atk cn 





and 
rbal 
rbil- 
1e is 
ond 
cor- 
lity 
"est. 
ame 











PRIMARY ABILITIES TESTS 135 


ing, Addition, Marks, Number Patterns, Cards, Multiplication, 
and Word Number. The multiple correlation obtained, using 
these eleven tests, was + .48. 

It was apparent that there was little hope of raising the 
multiple coefficient by adding the remaining five tests, since 
they correlated very little with the criterion. It will be noted 
that with the eleven tests the correlation was no greater than 
that obtained by using only five of the abilities. 

White (17) in 1942 made a study of the Thurstone tests 
and college prediction in Home Economics, using 94° of the 
same 113 subjects used by Tredick. At the time Tredick con- 
ducted her study these subjects were freshmen and had com- 
pleted their first semester’s work. When White conducted her 
study, the 94 subjects she used of Tredick’s group were then 
sophomores. White combined the grades for all of the indi- 
vidual courses taken by her subjects and took as the represen- 
tative score the students’ average for each of the particular 
types of courses. Semester point average in White’s study 
represents the average of two years of college work by her sub- 
jects. Correlations were then obtained between each of the 
primary abilities and the average score for work in Art, Science, 
Home Economics, and semester point average. The results 
are shown in Table 10. 

It is worth comparing the results obtained by White as 
shown in Table 10 with those of Tredick in Table 4. White’s 
data showed that each of the abilities correlated lower in all 
cases but one with semester point average than those obtained 
by Tredick. The exception is reasoning ability, which corre- 
lates + .45 with semester point average and indicates an increase 
over the correlation of + .42 obtained by Tredick. 

The correlations in White’s study for the abilities with Art 
average are generally higher than those found by Tredick. 
For Science average the 7’s of the abilities calculated by White 
are mixed, some being higher and others lower than those 
Tredick obtained for the abilities with Chemistry. White’s 
correlations are all higher with English average than those cal- 

3 Actually, the subjects are the same in each study. However, nineteen of the 


original group used by Iredick had dropped out of college, and since data were not 
available for them, White was unable to include them in her study. 








136 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


culated by Tredick, with the exception of induction, which is 
slightly higher for Tredick. All of White’s r’s with Home Eco- 
nomics average are lower, with the one exception of induction, 
when compared with Tredick’s correlations for Home Economics 
101 and 109. Another fact worth noting is that in White’s 
study all of the correlations of reasoning ability are higher than 
those obtained by Tredick. It may be that reasoning ability 


TABLE 10 


Correlations of the Sixteen Primary Mental Abilities Tests and the Seven Abilities 
with Average Grades in Home Economics Courses 














Semester . . Home 
seuaene Art Science English Eco- 
Ability Tests 8 nomics 
yy 2 : 2 es : 2 ; 3 
P Identical Forms 05 2 08 — .04 02 
Verbal Enumeration .26 .19 .10 .13 23 18 38 20 .17 11 
N Addition 33 A A4 30 19 
Multiplication w~eetatati a we uM 
V Completion 48 .29 32 59 .33 
Same-Opposite Be aarenrs & aw 2 
S Cards 12 .20 14 07 04 
Figures PRA BE mem HM TF * 
M Initials .20 14 23 28 05 
Word Number 1S 20 04 .11 26 28 .17 26-.03 .02 
I Letter Grouping 28 23 .26 23 21 
Marks 12 24 14 25 12 23 14 #18 06 14 
Number Patterns .20 . 23 .20 14 09 
D Arithmetic 43 23 51 27 35 
Number Series 42 AS 29 30 48 49 31 30 27 .36 
Mechanical Move- 
ments 10 10 04 06 5 





* Column | indicates r between tests and course. 

t+ Column 2 indicates r between ability and course. 
becomes more important when, presumably, the degree of diffi- 
culty of college work increases. One more fact worth noting 
is that White’s correlation of verbal ability with English aver- 
age was +.65, while Tredick obtained an r of +.55 for verbal 
ability with English Composition. 

White then correlated each of the sixteen Thurstone tests 
with the average for each of the four courses and semester point 
average. These results are also shown in Table 10. Twenty- 








ah tw &— oktlCtrlCUrh ll lhCU Kh ULC 


“ae 





h is 
“co- 
10n, 
nics 
ite’s 
han 
lity 


ilities 





ome 


mics 


tS 


All 


32 


10 


02 











PRIMARY ABILITIES TESTS 137 


two per cent of the correlations for the single tests in Table 10 
are higher than, or equal to, the correlations for the abilities. 
In many cases the correlations of the tests of the ability are 
only slightly smaller than correlation for the ability. Finally, 
White computed multiple correlations, using first, combina- 
tions of the abilities, and secondly, the individual tests. When 
she combined number, verbal, memory, induction, and reason- 
ing abilities with semester point average, she obtained a mul- 
tiple correlation of +.59. Combining number, verbal, and 
reasoning abilities and correlating them with semester point 
average, she also obtained a multiple correlation of +.59. 
Lastly, she combined the five individual tests of Completion, 
Same-Opposites, Arithmetic, Number Series, and Addition, and 
obtained a multiple correlation of + .62. 


Other Thurstone Studies 


A number of other studies have been reported on the pre- 
dictive possibilities of the Thurstone Primary Mental Abilities 
Tests. Yum (19), ina study at the University of Chicago, cal- 
culated multiple correlations using various combinations of the 
abilities with semester average. The best multiple correlation 
he obtained was + .422 using all of the abilities. Shanner and 
Kuder (10) have reported correlations of the abilities with 
average grades for the 1938 freshman class of the University 
of Chicago. The highest correlation reported for average 
grades was with verbal ability, +.415. Correlations are also 
reported by these writers for the abilities with four introduc- 
tory courses at the University of Chicago. Deduction corre- 
lated with Biological Science, + .418; verbal ability, + .472 with 
Humanities; deduction, + .485 with Physical Sciences; deduc- 
tion, + .427 with Social Sciences. Multiple correlations using ° 
all seven of the abilities with the four introductory courses 
yielded the following R’s: + .500 with Biological Sciences, + .541 
with Humanities, + .561 with Physical Sciences, + .566 with 
Social Sciences. 

In 1941 Ellison and Edgerton (6) tested a group of 49 stu- 
dents at Ohio State University with the Thurstone Primary 
Mental Abilities Tests. The highest correlation obtained, 











138 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


+ .44, was for verbal ability with point hour ratio. The remain- 
ing ability correlations with point hour ratio range from — .24 
to +.31. Combining the seven abilities, they obtained an R 
of +.640 with point hour ratio. Ellison and Edgerton also 
report correlations for the abilities with grades in college sub- 
jects. The highest correlations reported were: verbal ability 
with English, + .75; verbal ability with Science, + .68; induction 
with Foreign Language, + .78; reasoning with Psychology, + .63. 
They state “that the results can only be taken as suggestive 
and not as facts from which broad generalizations may be 
drawn.” 

On the basis of the studies reported in this paper, the fol- 
lowing conclusions appear to be justified: 

1. The Thurstone Primary Abilities Tests correlate, on the 
whole, as well as most standardized intelligence tests with cri- 
teria of college success. 

2. The Thurstone Primary Abilities correlate with indi- 
vidual college courses to some degree and can be used for pre- 
diction of success in these courses. 

3. Verbal ability correlates higher than any other of the 
abilities with semester point average and individual college 
courses. 

4. Multiple correlations obtained by using various combina- 
tions of the primary abilities yield some increase in correlation 
over those obtained by using single abilities, when correlated 
with semester point average. 

5. Multiple correlations using various combinations of the 
single tests that measure the primary abilities yield correlations 
with semester point average that were in some cases higher than 
or equal to those obtained using the abilities. 

6. Verbal ability correlates highly with the Otis and Pressey 
tests, and appears to be overlapping some of the functions in 
these general intelligence tests. 

7. The Otis and Pressey tests appear to contain some of the 
same functions as those measured by the seven Thurstone pri- 
mary abilities. 

8. The single tests measuring an ability in some instances 
correlated highly with each other. 


ee 











1] 


13 


14 



































PRIMARY ABILITIES TESTS 139 


9. A single test of an ability will in some instances correlate 
higher with the criterion than does the composite of several | 
tests of the ability itself. ‘a 


REFERENCES 


1. Ball, F. J. A Study of the Predictive Values of the Thurstone 
Primary Mental Abilities as Applied to Lower Division 
Freshmen. The Pennsylvania State College, 1940. (Un- 
published thesis. ) 

2. Bernreuter, R.G. The Personality Inventory. Stanford Uni- 
versity: Stanford University Press, 1931. 

3. Bernreuter, R.G. “Primary Ability Tests Applied to Engineer- 

{ ing Freshmen.” Psychological Bulletin, XXXVI (1939), 

548. i 
| 4. Bernreuter, R. G. and Goodman, C. H. “A Study of the Thur- it 





stone Primary Mental Abilities Tests Applied to Freshmen 
Enginering Students.” Journal of Educational Psychology, 

XXXI (1941), 55-60. 

5. Chein, I. “An Empirical Study of Verbal, Numerical and 

Spatial Factors in Mental Organization.” Psychological 

Record, III (1929), 71-94. 

6. Ellison, M. L. and Edgerton, H. A. “The Thurstone Primary 

Mental Abilities and College Work.” Educational and Psy- : 

chological Measurement, 1 (1941), 399-406. “a 

| 7. Goodman, Charles H. Ability Patterns of Engineers and Success 
in Engineering School. The Pennsylvania State College, 
1941. (Unpublished thesis.) 

8. Hessemer, Marianne. The Thurstone Primary Mental Abilities 
Tests in a Study of Academic Success in the School of Chem- 
istry and Physics. The Pennsylvania State College, 1942. 
(Unpublished thesis. ) 

9. Schneck, M. M. R. “The Measurement of Verbal and Numer- 
ical Abilities.” Archives of Psychology, XVII (1929), No. 
107. 

10. Shanner, W. M. and Kuder, G. F. “A Comparative Study of 
Freshman Week Tests Given at the University of Chicago.” 
Educational and Psychological Measurement, I (1941), 
85-92. 

11. Stalnaker, J. M. “Primary Mental Abilities.” School and 
Society, L (1939), 568-572. 

12. Strong, E. K. The Vocational Interest Blank. Stanford Uni- 

| versity: Stanford University Press, 1938. 

13. Tredick, Virginia D. The Thurstone Primary Mental Abilities 
Tests and a Battery of Vocational Guidance Tests as Pre- 
dictors of Academic Success. The Pennsylvania State Col- 
lege, 1939. (Unpublished thesis. ) 

14. Thurstone,L.L. Manual of Instructions. Washington, D. C.: 

American Council on Education, 1938. 





140 
15. 


16. 
17. 


18. 
19. 





EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Thurstone, L. L. Primary Mental Abilities. Psychometric 
Monographs, No. 1, Chicago: University of Chicago Press, 
1938. 

Thurstone, L. L. “Current Issues in Factor Analysis.” Psy- 
chological Bulletin, XXXVII (1940), 189-236. 

White, Elizabeth W. The Use of Certain Tests in the Prediction 
of College Success As Applied to the School of Home Eco- 
nomics. The Pennsylvania State College, 1942. (Unpub- 
lished thesis. ) 

Wolfle, Dael. Factor Analysts to 1940, Psychometric Mono- 
graphs, No. 3, Chicago: University of Chicago Press, 1940. 

Yum, K. $. “Primary Mental Abilities and Scholastic Achieve- 
ment in the Divisional Studies at the University of Chi- 
cago.” Journal of Applied Psychology, XXV (1941), 712- 

720. 




















TEST CONSTRUCTION IN PUBLIC PERSONNEL 
ADMINISTRATION 


DOROTHY C. ADKINS 


Social Security Board 


Introduction 


Wen the public is persuaded that positions in the public 
service should be filled by the best qualified persons and ex- 
presses its conviction through a civil service law, a tremendous 
responsibility, that of predicting which persons actually are the 
best qualified, devolves upon the agency charged with adminis- 
tering the law. Increased emphasis on the impartiality of the 
selection of public officials has been accompanied by growing 
reliance on examining processes that are objective. Hence, 
the major attention of this article is devoted to topics relating 
to the construction of the written examination in civil service. 
In this interpretation, problems common to the academic set- 
ting have been largely excluded. Although the article is further 
restricted to problems arising in state civil service or merit sys- 
tem jurisdictions, certain of the comments may apply equally 
to civil service at the federal level. 

However obscured it may be in practice, the essential of any 
merit examination is that it predict efficiency on the job. Those 
who are not likely to perform satisfactorily on the job should 
be excluded from the final list of eligibles, and those who achieve 
places on the register should be ranked in the order of predicted 
job performance. To these ends each part of the total examina- 
tion process should contribute. 

State civil service examinations often include, in addition to 
a written test, a rating of training and experience and an oral 
interview. If proficiency in the operation of machines or equip- 
ment is essential, a performance test may be one of the com- 


141 

















142 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


ponents. For some positions phases of the examination may 
well be modified or omitted, but usually not the written 
examination. 


Rating of Education and Experience 


Most state jurisdictions determine initially who shall be 
admitted to examinations by prescribing minimum qualifica- 
tions of education and experience. Within limits, additional 
education may be substituted for a part of the experience, and 
vice versa. The process thus screens first those candidates 
who do not meet the minimum qualifications. Those barely 
meeting the education and experience requirements may be as- 
signed the lowest passing score and those surpassing the mini- 
mum requirements higher ones, the scores depending upon such 
factors as amount, pertinency, and recency. The inclusion of 
ratings of training and experience as one part of the examina- 
tion process is based on the assumption that differences among 
the candidates on these factors will be reflected in job per- 
formance. 

The argument that the entire burden of the examination 
process should be placed on such ratings is untenable, however, 
since the rating procedure is relatively subjective and unreliable 
and fails to take cognizance of variations in the degree and ex- 
tent of knowledges and abilities that exist even among those of 
similar training and experience. Such variations often are 
of greater significance in the prediction of job performance than 
are differences among candidates in education and experience 
beyond the entrance requirements. Further, if these require- 
ments are relatively high, differentiation among candidates 
meeting them may, by means of disproportionate weighting of 
mere survival, give undue advantage to those who have failed 
to progress. 

Thus if, in relation to the salary level and labor market, 
minimum requirements for a professional class are low, say col- 
lege graduation, then assignment of a high education-experi- 
ence score to a candidate with two years of pertinent graduate 
work and three years of closely related experience probably 
would contribute to the validity of the total examination. But 














oD 


~~ = 








——— ~EEEEe 





TEST CONSTRUCTION 143 


if for a similar class entrance requirements are higher, say two 
years of pertinent graduate education and three years of experi- 
ence, then giving a higher score to a candidate with five years of 
experience than to a candidate barely meeting the minimum 
qualifications may be of questionable value. The principal 
effect of such practice in this instance might be to give unwar- 
ranted higher scores to candidates who, although older, have 
not advanced beyond younger candidates also meeting the high 
requirements for the class. Moreover, the rating of training 
and experience does not provide an evaluation of personality 
differences. 
Th Oral Interview 


The oral interview, despite its recognized limitations, seems 
to be the best available instrument for appraising personality 
characteristics. For civil service examinations little or no re- 
liance can be placed on paper-and-pencil tests of personality, 
which would fail to elicit frank answers in a competitive situa- 
tion with jobs at stake. Behavior in an oral interview can also 
be faked. For this reason, as well as unreliable rating, the oral 
interview included in the final score is usually weighted consid- 
erably less than are the other two parts, even though it con- 
tributes positively to validity, and the number of candidates 
failed on the oral is very small. 

Warranting at least passing mention are certain miscon- 
ceptions of what constitute desirable purposes of an oral inter- 
view, such as the idea that its aim is to reappraise the appli- 
cant’s education and experience or that it should be designed to 
ascertain the scope of the candidate’s knowledge or the degree 
of his general intellectual abilities. Since such factors can be 
measured more adequately by the other parts of the examina- 
tion process, the inclusion of a relatively unreliable rating of 
them in the total score probably serves to lower not only the 
reliability of the total composite but, more critically, its 
validity. 

The Performance Test 


Where differences in skill in operating machines is a perti- 
nent factor, the total examination usually includes a perform- 





144 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


ance test. For positions requiring such skill, the oral inter- 
view is of negligible importance if the duties typically entail 
little contact with the public or with fellow employees. Be- 
cause of this and not because they serve similar purposes, per- 
formance tests and oral interviews are rarely used for the same 
class. Differential ratings of training and experience are also 
commonly omitted for machine-operator classes. Performance 
tests are costly, difficult to administer, time-consuming, and 
frequently not very reliable. But their use may add appre- 
ciably to the predictive efficiency of the composite. In view 
of their limited reliability and their failure to take adequately 
into account the ability to profit from further training and 
experience, they are more appropriately regarded as a supple- 
ment to rather than a substitute for a written test. 


The Written Test 


Test Content. Even though the other parts of the total 
examination process may be so important for certain classes 
that they should not be dispensed with, no other is so significant 
or should bear so much weight as a well constructed written 
examination. The purpose of the written test is to determine 
reliably the extent of individual differences in pertinent areas 
of knowledge and abilities. Defining areas of knowledge and 
the abilities to be sampled in an examination is achieved by 
means of job analysis, the results of which are commonly sum- 
marized in a class specification. Such a compendium, depend- 
ing upon its thoroughness and clarity, gives a picture of the 
duties of the class of positions, qualifications in terms of 
education and experience, the supervision exercised and re- 
ceived, and the knowledges and abilities bearing on the duties 
of the class. This information should be supplemented by 
knowledge of the relationship of the class to the total organiza- 
tion and of the applicable salary range. First-hand acquain- 
tance with the job, although unfortunately not always avail- 
able, is of inestimable value. Since the written examination 
must contribute to the prediction of efficiency on a particular 
type of job, its most effective construction is contingent upon a 
clear idea of what the job is. Its subject matter should first of 

















re cece TIERS SE 


eee 








TEST CONSTRUCTION 145 


all be related to the prediction of job performance; and the 
weights to be assigned different areas of subject matter should 
depend upon their contribution to the prediction of efficiency on 
the job. 

The types of subject matter included in a test have all too 
often been limited by lack of facilities essential for item con- 
struction. Administrators and test technicians place too little 
reliance on persons thoroughly familiar with the subject matter. 
Test technicians sometimes delude even themselves into believ- 
ing that they can construct or at least assemble an adequate 
examination in a specialized field without assistance. Much of 
the criticism directed at written tests is attributable to action 
based upon just such misguided self-confidence. On the other 
hand, examining agencies should not vest sole responsibility 
for either item construction or assembling examinations in 
subject-matter consultants who are not skilled in examination 
techniques. The sounder and more successful approach as- 
sumes collaboration between those schooled in content and 
those versed in technique, although an occasional agency may 
be fortunate enough to have on its staff a person thoroughly 
competent in both a subject-matter field and examining pro- 
cedures. 

The content of the written examination should be related to 
realistic class specifications. It is by necessity limited by the 
facilities for constructing or securing items. A third factor 
that cannot be ignored in its bearing on examination content 
is public opinion. Almost universally, civil service examina- 
tions are required to be “practical and related to the job,” al- 
though the exact statement of the criterion varies. It would 
seem at first thought that establishment of a “satisfactory” 
relationship between test scores and performance on the job 
would guarantee the meeting of this criterion—and the merits 
of such an argument will not be denied. Pertinent to this con- 
tention, however, is the difficulty of establishing that the corre- 
lation of an examination with job performance is “satisfactory,” 
particularly in such a way that the public will accept the 
demonstration as proof of the “practicality” of the examination. 
Faced with this dilemma, the majority of civil service agencies 








146 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


interpret the term practical to mean what they think the public 
means by practical. A “practical” examination, then, becomes 
one that looks practical to the lay person, and, particularly, to 
the candidate. An examination may of course be satisfactorily 
valid from the point of view of its correlation with job perform- 
ance and yet fail to meet the criteria for “face validity” of 
this type. In the interest of fostering public support of the 
civil service principle, construction of examinations efficient in 
predicting job performance and also acceptable to the public 
as practical is the desideratum. An examining body that limits 
itself to the kind of face validity under discussion is not in any 
sense adequately fulfilling its function; nevertheless, its general 
aim will in the long run be appreciably advanced if it achieves 
discriminating examinations that, at the same time, have this 
type of validity. 

Rigorous application of this criterion of practicality requires 
not only careful scrutiny of the broad areas of knowledge and 
abilities sampled in the test as a whole, and of the general area 
covered in each item, but also attention to the several in- 
dividual concepts contained in each item. Sometimes attention 
must be focused on the individual word. In a test composed 
of multiple-choice items, for example, the candidate may not 
recall merely the question or just the question and the best 
answer—obviously, he may not know which answer is intended 
to be best. But he may remember and criticize some of the 
answers intended to be “distracters,” “decoys,” or “confusions,” 
as they are variously termed. If an item constructor inno- 
cently includes Socrates in a distracter, he should not be sur- 
prised if the examination is later publicized as absurdly 
impractical because it inquires into the candidates’ Greek 
philosophy. Particularly in the case of civil service agencies 
not far removed from the publics they serve, every examination 
item should be reviewed from the standpoint of how it might 
look in the public prints. 

Speed versus Power. Public opinion also has important 
bearing on other aspects »f the examining process. For ex- 
ample, it is more favorable to a power test than to a test that 
places a premium on speed. People say job duties are not per- 








EE a 


a eee 














it 
x= 
at 
f- 

















TEST CONSTRUCTION 147 


formed in a setting that emphasizes speed and competition. 
These in turn, they say, create anxiety and thus poor perform- 
ance. The public subscribes to the belief that “accuracy is 
more important than speed” without recognizing the corollaries 
that for many jobs, rapid is preferable to slow accuracy, and 
that speed and accuracy, at least in relatively simple tasks, 
tend to be positively correlated. 

Many candidates, and particularly those who have failed 
or attained low ratings on a time-limit test, share the layman’s 
preference for work-limit tests. Some may believe that speed 
tests unduly penalize the older, more experienced worker. The 
fallacy in this argument is twofold: first, although speed of per- 
formance of some functions, as measured by tests, probably 
does decrease somewhat with age, in reality the speed of the 
performance of the same candidate at different ages (within 
the usual age range of candidates but excluding the age of 
senility) does not usually differ markedly; and, second, if the 
speed of performance of particular functions did decrease ap- 
preciably with age, then, for those positions for which speed of 
performance of these functions is important, reflection of this 
decrease in the test score would increase the validity of predic- 
tion. Actually, comparisons referred to by candidates are usu- 
ally not based upon the performance of the same candidates at 
different age levels but rather upon that of persons who differ 
not only in age but also in basic abilities and who still would 
differ in abilities even if they were of the same age. The older 
candidates may get lower scores on a speed test and thus appear 
to be penalized. The score is not a result of age but an indica- 
tion of lesser ability. Reflect that most older candidates for a 
job at a level for which a speed test is frequently given have 
worked for a number of years without progressing as far as jobs 
which will be beginning jobs for the younger candidates. Thus, 
among candidates for clerical positions, those aged 55 do not 
represent the same kind of sample of 55-year-olds as those aged 
19 represent of 19-year-olds. 

Regardless of the false premises underlying this predilection 
for power tests, probably the best forecasts of performance in 
the higher-level positions will result from tests with time limits 











148 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


so liberal that only habitual laggards will have difficulty in at- 
tempting all the items within the time allotted. For the major- 
ity of clerical positions, however, where speed of performance 
clearly is an important component of job proficiency, achieve- 
ment of the best possible prediction now will in the long run 
outweigh any advantage that might be gained by a concession 
to public opinion at the moment. The inclusion of a speed test 
as one component of an examination, even for a clerical class, 
may wisely be accompanied by a campaign to educate the in- 
terested public on the reasons for its use. 

Length. Every person who has ever attended school has 
opinions about examinations. Thus public opinion impinges 
on the proper length of atest. The school examination, taking 
from 30 to 50 minutes, is regarded as typifying the most trying 
ordeal to which a candidate should be subjected. It is forgot- 
ten that normally the results of any one such school examina- 
tion are considered along with the results of several others, 
daily observation, and the appraisal of performance on many 
additional assignments; and, more important, that such an ex- 
amination is designed to measure achievement in a relatively 
limited area of a single subject-matter field. In contrast, the 
civil service written test is supplemented, if at all, only by a 
rating of education and experience and possibly by a brief oral 
interview or a performance test; and it is designed to sample a 
much greater complexity of knowledges and abilities. 

When candidates clearly realize this distinction, when they 
understand why a test that adequately samples the major fields 
relating to a job is more reliable and more valid than one re- 
stricted in coverage, they prefer tests sufficiently long to yield 
a reasonably just appraisal of their job potentialities. Al- 
though candidates may complain that long examinations are 
endurance contests, experimental work on prolonged mental 
effort indicates that the affect is more potent than the effect. 
Unfortunately, some civil service jurisdictions capitulate to 
public pressures for short examinations instead of enlightening 
the public on the virtues of comprehensive sampling. 

Type of Item. Decision on the type of item best adapted 
to civil service tests should be weighed against adequacy of 














-al 


ire 








TEST CONSTRUCTION 149 


coverage of the pertinent areas of knowledge and abilities, 
objectivity of scoring, and ease of administration. All things 
considered, the use of a large number of objective items is in ° 
general preferable to reliance on an essay examination. In 
civil service, objective tests have largely superseded essays. In 
agencies that still use them, the tendency towards supplementa- 
tion by the objective type increases. 

If broad, general essay questions are used in an effort to 
cover particular fields of subject matter, different candidates 
may not actually be answering the same question, a factor that 
may nullify attempts to place them in an order of merit. If, on 
the other hand, essay questions are more pointed and limited 
in scope, the coverage of the requirements is automatically 
restricted, although more reliable grading can be achieved. 
The objective test not only has the advantage of permitting a 
broader sampling of pertinent knowledges and abilities than the 
essay test but also, if properly used, will almost certainly lead 
to markedly greater reliability in scoring. These are highly 
important advantages in civil service, where areas of subject 
matter to be covered are broad and where unreliability of scor- 
ing may have a greater effect upon human destinies than in 
almost any other field of testing. 

Test scoring should be objective and also as simple as pos- 
sible in the interests of efficiency. Wherever feasible, the same 
kind of item should be used throughout and the same weight 
assigned to each item. The multiple-choice form lends itself 
admirably to most purposes. Many forms that appear to differ 
are simply variants of this form. The argument that use of a 
variety of forms adds interest to a test is insignificant, since the 
competitive setting supplies more than enough motivation. 
Arousing interest in a test per se is an empty gesture. Items, 
and even the several responses to an item, are sometimes dif- 
ferentially weighted in civil service tests, but there is growing 
awareness that unweighted and weighted scores for a large num- 
ber of items correlate so highly that little is to be gained by 
differential weighting. 

Use of items of a single, readily understood type, accom- 
panied by clear instructions to candidates and with one over- 





150 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


all time limit, simplifies test administration and hence may 
contribute positively to test validity. This is one way of over- 
coming the handicap of monitors poorly trained or too unstable 
to meet emergencies with poise and skill. 

Repeated Use of Items. In the educational world test 
questions are sometimes used more than once, to the gratifica- 
tion of certain fraternities that keep files of questions available. 
Scores tend to be higher on the second administration of a test. 
Perhaps in college repeated use is not very serious, since a course 
mark usually does not depend on one examination alone and 
since a mark in a single course may be of no great moment. 
Institutions that are placing dependence on comprehensive 
examinations and little if any on course marks, however, are 
increasingly following the practice of using test questions only 
once. The problem of maintaining the confidential nature of 
examination materials is even more urgent for civil service 
jurisdictions. There may be organized efforts on the part of 
“cram schools” to obtain access to examination items used; 
some candidates may apply for the sole purpose of memorizing 
assigned portions of an examination. In the absence of such 
studied attempts to sabotage a merit system, candidates never- 
theless may remember a few items in detail and another group 
of questions to “look up.” Whether or not this factor would 
give any odds to these candidates or their friends if the same 
items were repeated, some members of the public might think 
there would be an advantage. Thus the confidential nature 
of the examination fuses with the question of public relations. 
A civil service agency better maintains the support of the pub- 
lic if it can give assurance that no appreciable advantage could 
accrue to any candidate because of previous use of items. 

Statistical Analyses of Items. From the point of view of 
the reliability and validity of a test, inclusion of a few poor 
items is negligible if the test as a whole is long enough. In 
one way, however, the problem created by indefensible items is 
greater in civil service tests than in almost any other kind. 
This is true because a difference of one point may affect a can- 
didate’s rank order and also may determine his passing or fail- 
ing. Hence he may or may not get a job. Since a difference 











ire 
ns. 
ib- 
ld 





ee 








TEST CONSTRUCTION 151 


of even one point has this importance, great care should be 
exercised to exclude from a test items that might have a nega- 
tive validity. The problem is of consequence, too, as it relates 
to public faith in the merit system. A widely heralded appeal 
based on a few weak items, which may really be of no great 
moment insofar as they affect the over-all reliability and valid- 
ity of measurement, can go a long way toward undermining 
public support of the merit principle. 

In speaking of the validity of a civil service test, we should 
have in mind the extent to which the test serves its basic pur- 
pose, prediction of performance on the job. To establish incon- 
trovertibly that a test has validity for this purpose is of course 
dificult. If a candidate population is used, the failing candi- 
dates do not get jobs, so that there is no information regarding 
their work, and the distribution of job performance indices is 
thus curtailed to an indeterminate extent. Even more trouble- 
some is the notorious unreliability of service ratings commonly 
used to appraise success on the job. Made by a number of 
raters, with individual standards, on employees performing for 
varying periods on jobs that differ although grouped under one 
classification, service ratings are sensitive to many uncontrolled 
and uncontrollable factors. Too frequently clearly recogniz- 
able differences in job performance are obscured by a tendency 
to rate employees in the “above average” categories, with the 
result that distributions of ratings exhibit marked negative 
skewness. And this is only one of the frailties to which the 
rater is heir! Unfortunately it is almost impossible to state 
just how unreliable service ratings are, since conditions for a 
really crucial experiment in this area have not been met in prac- 
tice. Probably, however, .30 to .50 represents a safe estimate 
of the Pearson correlation coefficient between two completely 
independent sets of ratings, made by raters assumed equally 
familiar with the work of employees on jobs within the same 
class, using rating forms developed with an ordinary amount of 
skill and care. A similar correlation of ratings made specifi- 
cally as a criterion against which to judge a test would tend 
to be higher. In any case, problems of securing as a criterion 
reliable evaluations of job performance on a population of suf- 
ficient size are very great and in most instances insurmountable. 





152 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Establishment of test validity for a particular group of can- 
didates is troublesome enough; but efforts to prevalidate on one 
group a test to be used on another group for selective purposes 
are more hazardous still. 

If a population of employees rather than candidates is used, 
the problem of proportionate representation of those who would 
fail the test if they were candidates remains. Assurance is 
lacking that an employee group, for purposes of a validity ex- 
periment, is representative even of the passers among a candi- 
date group. Parts of a test that are satisfactorily valid for 
a candidate group may not differentiate among employees, who 
have had experience on the job and who may all have learned 
to do certain kinds of tasks that only the abler of the candi- 
dates could perform. Moreover, if the confidential nature of 
tests is to be maintained, an agency may not wish to be in the 
position of repeating, either wholly or in part, a test from which 
there may have been “leakage.” 

Attempts to secure an experimental group of subjects who 
are neither candidates nor employees face these difficulties as 
well as the impossibility of obtaining any measure of job per- 
formance. 

Confronted with these obstacles, the experimenter in the 
field of civil service tests usually resorts to a candidate popula- 
tion and an internal criterion, which is customarily the total 
score on a test composed of items sampling a variety of areas 
of knowledge and abilities. Conclusions should be drawn only 
charily from analyses based on such a multifactored criterion. 
Since different areas are represented in the variance of the total 
scores to differing and usually unknown extents, dependent 
upon the variances and interrelations of the part scores, gen- 
eralizations on relative validities are apt to be inappropriate 
and misleading. The item-test coefficient must be interpreted 
in relation to the item difficulty for the population in question. 
An item that fails to differentiate among candidates for one type 
of position may discriminate positively for another. Any items 
with negative coefficients should be carefully scrutinized with a 
view to improving the item if its re-use is contemplated and, in 
any event, to gaining insight into the characteristics that may 














ne 
eS 


od, 
ild 

1S 
X= 


the 
la- 
tal 
eas 
nly 
on. 
tal 
ent 
en- 
ate 
ted 
on. 
ype 


1ay 








TEST CONSTRUCTION 153 


lead to negative coefficients. An item that correlates nega- 
tively with an internal criterion may, however, be positively 
related to the ultimate criterion to be predicted. 

For multiple-choice items, it is advantageous to examine 
not just the item-test coefficient, which may be regarded as a 
kind of summary statistic, but also the relationship of each 
option to the criterion. This has been approached in various 
ways—by correlating each choice with the criterion, by finding 
the mean criterion score of those who select each choice, by 
finding the proportion of persons in each of two contrasted 
criterion groups for each choice, or by some other method, de- 
pending upon the type of item index preferred. With access to 
this kind of information for each choice, the examiner can often 
improve items and learn how to construct better ones. Within 
certain limits, which of the many variant forms of item-analysis 
techniques one uses seems relatively unimportant. Ideally, it 
may be preferable to compute for each choice for each item a 
correlation coefficient (either a biserial, a point biserial, or 
both) with a multivariate criterion.t If interest centers in 
only a small number of items, this approach is entirely prac- 
ticable. If, however, results on several hundreds or even 
thousands of items are available, a degree of statistical refine- 
ment may profitably be sacrificed in the interests of using a 
larger proportion of data at hand, particularly if facilities for 
such research are limited. In this case, a reasonably satisfac- 
tory technique may be to use the tetrachoric correlation coefhi- 
cient with the criterion dichotomized at the median. Here, 
again, it will be of value to find in addition the proportions of 
candidates above and below the criterion median who select 
each choice. For most purposes, a practical index of item dif- 
ficulty is simply the percentage of candidates who select the 
“best” answer. 

Some agencies that do not at the present time want to re- 
peat items nevertheless apply item analysis techniques with a 


1 Since normally it is not intended to repeat any large group of items from a 
given test, there i is no advantage, and some disadvantage, in the application of any 
of the “build- -up” methods, such as those of Toops, Horst, and Richardson, in ana- 
lyzing civil service tests. In any case, these methods are designed for use with an 
external criterion, although the first two may be modified for an internal criterion. 








154 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


view to gaining insight into the characteristics of good and 
poor items. Both subject-matter consultants and test techni- 
cians will profit from reviewing results of statistical analyses, 
even though the values may be in large part indirect. 

As data accumulate over a period of years, it should be 
feasible, at least for classes attracting large numbers of candi- 
dates, to construct an examination largely from pretested items 
only a small number of which have been used before in any 
one examination. The results of statistical item analysis 
should not be applied blindly, for both the reliability and valid- 
ity of items may be altered strikingly by changes in the social 
milieu. A test item valid at one time may be based on a con- 
cept that later becomes such common knowledge that even the 
poorest candidate answers it correctly. Or it may hinge on a 
concept that becomes outmoded. For such reasons as these, 
even were statistical indices available for a group of items from 
so large a number of different tests that the confidential nature 
of examinations would not be jeopardized by judicious repeti- 
tion, competent consultants’ review of items shortly before their 
inclusion in a test would still be essential to sound test con- 
struction. 

Perhaps in the not too distant future factor analyses of civil 
service tests can dispel the problem created by the unknown 
contributions of several factors to the variance of internal cri- 
terion scores. Then items can be correlated with separate fac- 
tors instead of with a hodgepodge. Even so, the question of the 
appropriate weight for each factor in the composite could not 
be answered rigorously. The solution to this problem awaits 
development of a satisfactory external criterion. 

Establishment of Critical Scores. As those experienced in 
test construction know, even when statistical data are available 
the difficulty of an item cannot usually be predicted with high 
accuracy unless one has worked a great deal with the particular 
type of item and is predicting the difficulty for a group of can- 
didates having a known level of ability on the type of item in 
question. Although the prediction for each item may not be 
accurate, many civil service agencies assemble groups of items 
with a view to setting the passing point on an a priori basis. 








nd 
1l- 
eS, 


li- 


re 
ti- 
Ir 
n- 


le 





TEST CONSTRUCTION 155 


In some cases, the law or rules under which the agency operates 
make such a judgment mandatory. In many others, where a 
fixed passing point is not so specified, the agency nevertheless 
considers the predetermination of passing points desirable, pri- 
marily from the point of view of public acceptance of the 70% 
passing point via the educational system. Truce with this 
concept, hoary with tradition though it is, is not entirely with- 
out compensation. 

Although the judgment of the appropriate difficulty of a 
test is admittedly hard to make, adherence to a fixed percentage 
of the total number of items as the passing point, where its use 
is not clearly inappropriate to needs of operating agencies or 
unjust to candidates, puts the examining agency in a position 
to combat pressures for setting passing points so that particular 
candidates are passed. Thus the agency is better able to be 
impartial. Not infrequently, however, even in agencies that 
have a fixed percentage passing point set by law, scores are 
transmuted if an examination appears to have been so difficult 
that a register resulting from it would contain insufficient 
names to fill vacancies. Usually a linear transformation is ap- 
plied so that some percentage score less than 70 becomes a 
derived score of 70 and, obviously, so that some other condition 
is satisfied, such as the original top score being equated to a 
score of, say, 95 on the derived scale. The most appropriate 
second condition varies with the character of the original dis- 
tribution and with the desired properties of the distribution 
of derived scores. 

Although it may be considered advantageous to attempt to 
estimate test difficulty so accurately that no transformation 
of scores will be necessary, transmuting upward is considered 
preferable to transmuting downward, because candidates, while 
welcoming scores seemingly higher than actually attained, are 
reluctant to accept scores lower than the percentages of items 
answered correctly. Moreover, a test that is difficult will dis- 
criminate among candidates better than one that is too easy. 

Several factors in addition to the inherent complexities of 
estimating test difficulty magnify the problem for the civil 
service examiner. One is that the nature of the candidate 











156 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


population varies with the labor market; and the extent of 
change, as it pertains to the abilities and knowledges being 
measured, is not easy to forecast. It varies with recruitment, 
again unpredictably. It varies with changes in minimum 
qualifications and salary level, in such a way as to be especially 
bothersome if minimum qualifications are lowered at the same 
time that the salary level is increased, a combination not un- 
usual in a tight labor market. In fact, with the multiplicity of 
factors operative, it is surprising that results as satisfactory 
as those commonly achieved are possible. 

Sometimes effort is made to set the critical score at a 
“break” in the distribution, on the grounds that failing candi- 
dates will accept their lot more readily. Some evidence to 
bolster this point of view probably could be adduced. At 
times, however, the procedure seems to be based on the assump- 
tion that peculiar significance attaches to a segment of the 
range within which no scores fall. This is nonsense. It is 
doubtless possible to construct a test having scores with a 
bimodal distribution. Whatever gaps appear in distributions 
of civil service test scores, however, are due to chance, not to 
rational design. If the number of candidates is sizeable, there 
will be no appreciable breaks in the proximity of any reasonable 
critical score; and no matter where this score is set there will be 
scores just below it. All things considered, an agency should 
face this fact tough-mindedly and not waste time in seeking 
breaks in score distributions or in creating them artificially. 

Compiling Related Examinations. A task almost peculiar 
to civil service examination construction is that of assembling 
for a single administration examinations for sometimes as many 
as 35 or 40 classes, but perhaps more typically for from five to 
15. If there are no common requirements of knowledges or 
abilities among classes for which examinations are to be held, 
then the examinations are simply assembled independently. 
If there are some requirements common to two or more classes 
the examinations ordinarily include some common items sam- 
pling the overlapping ones, particularly identical degrees of a 
given kind of knowledge or ability. 

One argument for using overlapping items for related classes 





-_-- 

















ie 
i- 


of 








TEST CONSTRUCTION 157 


is that some candidates may want to take more than one 
examination; if the extent of overlapping is slight, such candi- 
dates may not complete some of the examinations in the time 
allowed. To permit candidates to take several examinations 
for related classes in a single day (usually the maximum time 
for administration of an examination program), the only alter- 
native to overlapping is shortening the total length of each 
examination, which is of course disadvantageous from the point 
of view of reliability. Some jurisdictions err in the other di- 
rection by constructing tests with so large a proportion of over- 
lap that differentiation among the tests is unreliable. The 
effect of this error is that candidates who fail one examination 
are likely to pass another for a higher class in the same series, 
because of the chance element in the small number of differ- 
entiating items. 

Such a result is not the reductio ad absurdum of the ex- 
amination process that it may seem at first glance. As a mat- 
ter of fact, identical examinations for classes that differ in 
degree rather than in kind would in some instances be ap- 
propriate were it not for interpreting to the public several 
passing point standards. On the other hand, the same passing 
points might be set for the written component of the examina- 
tions for several classes in a hierarchy, if the total examinations 
for the classes were differentiated on some other basis. Such 
a practice has sometimes been followed by the United States 
Civil Service Commission. For jurisdictions more limited in 
size of the public served, different written examinations for 
separate but related classes seem preferable, because some of 
the candidates for each are likely to be placed in the same 
agency on the basis of the examinations. 

Roughly stated, the most useful principle for differentiating 
examinations for classes in a series is that the examinations 
should contain enough different items to reflect reliably the dis- 
similarities in requirements and enough common items to en- 
able candidates who so desire to take three or four examinations 
at one sitting. A further important advantage in the use of 
common items is the economy effected in item construction and 
review. Although there is no rule applicable to all cases, prob- 








158 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


ably the minimum number of differentiating items for two 
examinations for closely related classes, each consisting of 200 
items, should be in the neighborhood of 50 to 70 items. The 
optimum number differs considerably from one pair of classes 
to another. 

Order of Items for Related Examinations. A special prob- 
lem of ordering items arises when, say, 15 examinations are 
encompassed in a booklet of, say, 1000 items, many of which are 
common to two or more classes. If they are ordered strictly 
according to subject matter, the items in any one examination 
are scattered; hence candidates who have to skip several groups 
complain that the mechanics are confusing and may waste their 
time. On the other hand, if the items are ordered in such a 
way as to minimize the number of “breaks” for the examina- 
tions having the largest numbers of candidates, the items can 
no longer be ordered entirely by subject matter, nor can an 
ordering from easier to more difficult be strictly maintained. 

If administrative controls are adequate, perhaps the best 
solution is to assemble for each candidate a booklet containing 
only the items pertaining to the examination or group of ex- 
aminations he is taking. Such a plan creates limitless con- 
fusion if the program is improperly administered. 

A solution that places a lesser burden on sound administra- 
tive controls is to retain the plan of assembling related ex- 
aminations in one booklet and to compromise among the 
objectives of arranging items (1) in the order that minimizes 
breaks in each examination, (2) in the order that seems logical 
so far as subject matter is concerned, and (3) in order of dif- 
ficulty. Perhaps the simplest way to arrive at an ordering 
reasonably satisfactory from these three points of view is first 
to establish an order that minimizes breaks for the most “pop- 
ulous” classes, then to adjust the ordering by subject matter, 
then further to rearrange the items so that for any single class 
the more difficult items in any distinguishable area follow the 
easier items. 

Selection and Training of Subject-Matter Consultants. 
Earlier emphasis was given to subject-matter consultation in 
civil service test construction. Brief mention will now be made 














»b- 
are 
are 
tly 
jon 
Ips 
eir 
1a 
na- 
an 


yest 
ing 


on- 
[ra- 


the 
izes 
ical 
dif- 
ring 
first 
,Op- 
ter, 
‘lass 
the 


nts. 
n in 
1ade 








TEST CONSTRUCTION 159 


of some of the problems of using consultants in this capacity. 
Both selecting and training consultants are simplified for those 
examining agencies needing full-time consultants and having 
money to pay for them. More typically agencies seek inten- 
sive but intermittent services. 

Unfortunately those who are recognized as authorities in 
their fields are also those who are likely to be employed in posi- 
tions from which release for temporary assignment elsewhere 
is not easy to obtain. On the other hand, even if an examining 
agency has a continuing job of such magnitude as to warrant a 
permanent assignment of a consultant, many of those who 
would be acceptable as consultants are loathe to abandon front- 
line operations to undertake the task of predicting the per- 
formance of others. They hesitate to leave a position that they 
know and like to pioneer in a job that they may find difficult 
to understand and that may offer no clear line of advancement. 
Having taken this hurdle, they become dissatisfied if oppor- 
tunities for contacts with others in their field are too limited 
or if for some other reason they “go stale” on the job. Because 
of these factors, some agencies tend to prefer making a series 
of temporary appointments, possibly on a part-time basis. 

Probably no single solution to the problem of selecting and 
training subject-matter consultants would meet all needs. 
Where it is appropriate and feasible, however, a plan of having 
a senior consultant on a full-time and permanent basis and ad- 
ditional consultants on a part-time or temporary basis has the 
advantage of providing continuity to the examination program 
while at the same time securing a reflection of different points 
of view in the examination content. This plan also simplifies 
the training problem, because a permanent subject-matter con- 
sultant who understands examination construction can inter- 
pret aspects of this field to other consultants in the same area, 
often more efficiently than can the psychometrician. 

Some persons who are sought as consultants and who may 
be recognized as authorities simply cannot be taught to con- 
struct examinations. Sometimes the difficulty is apparently 
attributable to temperament, sometimes to pattern of abilities, 
more often to the combined influence of both factors. To de- 








160 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


cide that a consultant cannot be taught to construct items may 
require only half an hour and again, for one who approaches 
the task with interest and enthusiasm, may take a month or 
more. Such ineptitude is not peculiar to any field. It can be 
noted in psychology, art, grammar, accounting, social work, 
law, music, and statistics. The fruitful solution to this problem 
is to get another consultant as tactfully as possible. 

Once a consultant is found who is both interested in and 
adept at item construction and test compilation, he should be 
given interpretation on the desirable length of an examination, 
scoring procedures, time limits, reliability and validity, trans- 
mutation of scores, weighting of several components, and simi- 
lar concepts. Two of the most common misconceptions of 
subject-matter consultants are confusion between the difficulty 
and the validity of an item and failure to differentiate between 
a test as a predictive instrument and as a teaching device. 
Much attention must be given to overcoming these and similar 
fallacies related to examinations. 

Magnitude of the Task. The final problem of which brief 
mention will be made is the scope of the examining job to be 
done within budgetary limitations. Although monies allocated 
to the examining function in any area are rarely of staggering 
proportions, probably the public personnel agencies suffer the 
most from paucity of funds. The number and variety of jobs 
for which examinations are to be constructed and the sizes of 
the populations to be examined in normal times are in many in- 
stances almost unbelievable. In a single jurisdiction hundreds 
of lives may be affected by a single examination program. Yet 
the annual budget for the examining function probably would 
represent only a small fraction of the sum allocated yearly to 
the “supervision of reindeer in Alaska.” Despite the handicaps 
of inadequate staff and limited facilities for research, the notable 
progress of the last decade places, on those in the field of 
measurement an obligation for continued effort toward the 
solution of the critical problems that remain. 

















ef 
be 
od 
ag 
he 
bs 
of 
n- 
ds 
et 


Id 


ps 
le 


he 














RELATIONSHIP BETWEEN KUHLMANN-ANDERSON 
INTELLIGENCE TESTS IN GRADE 1 AND ACA- 
DEMIC ACHIEVEMENT IN GRADES 3 AND 4 * 


MILDRED M. ALLEN 
New Rochelle Public Schools, New Rochelle, New York 


Tuts study is part of a doctoral research on the prediction 
of academic success of elementary school pupils by means of 
the Kuhlmann-Anderson Intelligence Tests. The purpose of 
this study was to determine the predictive value of the Kuhl- 
mann-Anderson Intelligence Tests as a whole when adminis- 
tered in Grade 1, in the fields of reading, arithmetic, and spell- 
ing in Grades 3 and 4 as measured by the New Stanford 
Achievement Test. 


Data 


The subjects for this study were three hundred and twenty- 
seven pupils from ten elementary schools in New Rochelle, 
New York. Complete test results of these pupils from the 
school years 1936-37 to 1939-40 were used. The tests were 
the Kuhlmann-Anderson Intelligence Test and the New Stan- 
ford Achievement Test. 


Procedure 


An alphabetical class list of fourth-grade pupils (1939-40) 
by schools was obtained from the school census clerk. From 
this list pupils were selected who were originally in the first 
grade in 1936-37. A checking and re-checking of all test re- 
sults dating from Grade 1, 1936-37, and including the fourth 
grade of 1939-40 was made in order to select a group of pupils 
who had taken the complete battery of tests as used in the 
present study. 





* Part of a study for a Doctor’s dissertation completed at New York University, 
Graduate School of Education, 1940. 


161 











162 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The Kuhlmann-Anderson Intelligence Test for Grade 1 
(Second Semester), was administered in February, 1937, when 
the pupils were midway through the first grade. The New 
Stanford Achievement Test (Primary Examination) Form Z, 
was administered in April, 1939, near the close of the third 
grade, and the Advanced Examination, Form W, of the same 
test was administered to the same pupils in the fourth grade 
in October, 1939. 

The Kuhlmann-Anderson Intelligence Test was personally 
administered, scored, and re-scored by the writer. The New 
Stanford Achievement Test (Primary Examination) was ad- 
ministered in the elementary schools by the principal or a 
teacher who had test experience and training. Both principal 
and teacher-examiner received instructions for the administra- 
tion, scoring, and tabulation of test results from the writer. 
The tests were scored, and double-checked by specified teach- 
ers in the respective schools; by the principal, or by the teacher- 
examiner. The New Stanford Achievement Test (Advanced 
Examination) was administered by the writer to all fourth 
grades in October, 1939. This test was scored and re-scored 
by an assistant under the direct supervision of the writer. 


Results and Interpretations 


The mental ability of the pupils used in this study is shown 
in Table 1. Correlation coefficients betweerf measures ob- 
tained from the Kuhlmann-Anderson Intelligence Test in 
Grade 1, and performance on the New Stanford Achievement 
Tests in Grades 3 and 4 are shown in Tables 2 and 3. 


TABLE 1 


Means, Sigmas, and Ranges of I.Q.’s on the Kuhlmann-Anderson Intelligence Test 
in Grade I and in Grade 4 











Test Mean L.Q. o Range 
Kuhlmann-Anderson 

Test-Grade 1, Feb., 1937 .......... 100.7 99 63-125 
Kuhlmann-Anderson 

Test-Grade 4, Oct., 1939 .......... 99.8 11.9 67-132 





The mean 1.Q.’s of the pupils indicate average ability for the 
group as a whole, with the range of I.Q.’s showing the wide 














KUHLMANN-ANDERSON TESTS 163 


scatter of ability commonly found in heterogeneous grouping. 
The intelligence level of the pupils remains about the same in 
Grade 4. (These were the same pupils tested in Grade 1.) 


TABLE 2 


Coefficients of Correlation Between Kuhlmann-Anderson Intelligence Test Perform- 
ance in Grade I and the New Stanford Achievement Test (Primary 
Examination, Form Z) Performance in Grade 








New Stanford Achievement Test, Kuhlmann-Anderson Test—Grade 1 








Grade 3 M.A. LQ. Pc.Av.* 
i. re 40 At 43 
2 3 "Eee eer 32 39 40 
TE IED voc vie cnccccsvewescoes By | 42 42 
Arithmetic Reasoning ............... 48 AS AS 
Arithmetic Computation .............. AS 43 43 
Arithmetic Average ...............2.. 52 50 A9 
ee arg kW Avalon aca ne Gabiicw a 36 40 42 
TI, 6 ihc s. coos sGateabsdeecess 46 49 A9 
eS SO Sere ree 48 By 53 
Educational Quotient ................ 40 67 67 





* The Pc.Av., or Per Cent of Average Development, is an index obtained by 
dividing an individual’s mental unit points by the average mental unit points for his 
age group, mental units being determined by conversion to a point scale designed by 
Heinis. The Pc.Av. is preferred by Kuhlmann to the I.Q,, since it is more constant 
for retests over a period of years. 

Table 2 reveals coefficients of correlation between the K whl- 
mann-Anderson measures and .educational achievement rang- 
ing from .32 to .53, with twenty of the thirty coefficients be- 
tween .40 and .50. In view of the fact that these coefficients 
have standard errors of from .03 to .05, no significant differences 
among the coefficients are indicated. The I.Q., Pc.Av., (Per 
Cent of Average Development) and E.Q. all include one com- 
mon element, namely, the chronological age (C.A.), which is 
not included in the mental age (M.A.) score. It may be noted 
that some persons regard as spurious correlations of ratios in- 
volving the same variable denominators.’ The critical ratio 
for the differences of the coefficients of .53 (1.Q. or Pc.Av., and 
E.A.) and .67 (1.Q. or Pc.Av., and E.Q.) is 3.1. 

Whether one considers the M.A., I.Q., or Pc.Av. seems to 
make little difference since all yield substantially the same co- 
efficients with the various subtest results on the Stanford Pri- 


1J. P. Guilford, Psychometric Methods, p. 374. 














164 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


mary Achievement Test. One part of the Stanford Primary 
Achievement Test is predicted as well from the various Kuhil- 
mann-Anderson measures as any other part. The only variable 
derived from the Stanford Achievement Test which shows any 
significant differences in the coefficients is the E.Q. An exami- 
nation of the last row of Table 2 shows that the correlation 
between M.A. and E.Q. is about the same as the other coefh- 
cients in the table. The relationships between both the E.Q. 
and 1.Q., and E.Q. and Pc.Av., are substantially higher (.40 
compared with .67 and .67). The critical ratio of the differ- 
ence between the correlation of M.A. and E.Q., and the cor- 
relation of I.Q. and E.Q. is 5.4. This indicates undeniable 
statistical significance of the difference. 

Excluding the coefficients involving E.Q., the median coef- 
ficient for Table 2 is .44. While this indicates some degree of 
relationship between theKuhlmann-Anderson scores and edu- 
cational achievement, it is by no means great enough to be of 
much value in predicting third-grade performance from the 
Kuhlmann-Anderson Intelligence Tests administered in Grade 
1. Due to differences in content material of the earlier Kuhl- 
mann-Anderson Test (non-verbal) in Grade 1, and the verbal 
content of the Stanford Primary Achievement Test in Grade 3, 
it appears that long-range predictions of academic achievement 
would not be reliable. To insure a greater degree of reliability, 
intelligence tests should be repeated annually, and preferably 
within the grade for which predictions are made. ‘The coefh- 
cient of alienation corresponding to an r of .44 is .8980, an in- 
dication that errors of prediction would be reduced by only 
10.2 per cent by the use of the Kuhlmann-Anderson I[ntelli- 
gence Test in Grade 1, instead of making predictions without 
the tests. The highest coefficient of Table 2, .53, between I.Q. 
(or Pc.Av.) and E.A. yields a k (coefficient of alienation)? of 
.8542, indicating a reduction in errors of prediction of only 14.68 
per cent better than chance. 

The most predictable measure obtained from the New Stan- 
ford Achievement Test, Primary Examination, when predic- 
tions are made from the Kuhlmann-Anderson Intelligence 





2 J. P. Guilford, Psychometric Methods, p. 362. 




















ee 


= 











KUHLMANN-ANDERSON TESTS 165 


Test administered in Grade 1, seems to be the Educational 
Quotient (E.Q.). The r of .67 between I.Q. (or Pc.Av.) and 
E.Q. yields a k (coefficient of alienation) of .7424 and indicates 
a reduction in error of prediction of 25.76 per cent better than 
chance. The E.Q., however, does not give any indication of the 
actual level of achievement, but only of achievement compared 
with C.A. It is significant, however, as an index of educational 
brightness. In this study the Pc.Av. (Per cent of Average 
Development) does not appear to have any advantage (for 
predictive purposes) over the I.Q. or M.A. scores obtained 


from the same test. 
TABLE 3 
Coefficients of Correlation Between Kuhlmann-Anderson Intelligence Test Perform- 


ance in Grade I and New Stanford Achievement Test (Advanced 
Examination, Form W) Performance in Grade 4 








Kuhlmann-Anderson Test—Grade 1 





Stanford Achievement Test—Grade 4 





M.A. 1.Q. Pc.Av. 
DUNN, THORNE ook 6.0 cicccnsetces 37 Al 42 
TE igo sk cas conse snes 30 36 Be . 
PIII, ig civ cede os ssoeteeaccy 35 A2 Al 
Arithmetic Reasoning ................. 52 51 50 
Arithmetic Computation .............. A9 46 49 
ROMUIEIS FPGEEEE 6 occ ccc cccscceene 56 53 53 
MEE San ce GG as tcviev bid pancneaceeee 36 39 42 
I ies ba cadn ca condiddicume 49 52 Se 
PE oi oki sss ccecseaeny A8 51 52 
Educational Quotient ...........20000 43 67 67 





Table 3 is very similar to Table 2, except that in this case 
the relationships shown are those between the Kuhlmann- 
Anderson Intelligence Test for Grade 1, and achievement mea- 
sured at the beginning of Grade 4. The Advanced Examina- 
tion, Form W of the New Stanford Achievement Test was used 
in this instance. ‘The intelligence test scores are the same but 
the achievement test scores were obtained approximately six 
months later (after a summer vacation) and were obtained 
from a somewhat more advanced examination. 

None of the coefficients in Table 3 are significantly different 
from the corresponding coefficients of Table 2. The general- 
izations regarding Table 3 are the same as those drawn from 
Table 2. Excluding those coefficients involving E.Q., the 














166 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


range of the coefficients of correlation is from .30 to .56, with a 
median r of .48, which is higher by only one o, of .04 than the 
median r of .44 in Table 2. In this instance it is again evident 
that the E.Q. is the most predictable measure of educational 
achievement. The reduction in errors of prediction due to the 
magnitude of r is practically identical with the corresponding 
reductions in Grade 3. The M.A., I.Q., and Pc.Av. are about 
equally effective for prediction. There seems to be no superior- 
ity of the Pc.Av. over I.Q. as a predictive measure. 

A study of Table 3 shows a higher correlation between all 
indices of intelligence (M.A., I.Q., and Pc.Av.) and arithmetic 
reasoning, than between the same indices of intelligence and any 
part of the reading tests. The correlation between the indices 
of intelligence and arithmetic average is higher than the corre- 
lations between the same indices of intelligence and reading 
average. The most noticeable shift in prediction is that of the 
arithmetic scores. 


Summary 


The most predictable measure obtained from the New Stan- 
ford Achievement Test, Primary Examination (Grade 3) and 
from the Advanced Examination (Grade 4) when predictions 
are made from the Kuhlmann-Anderson Intelligence Test for 
Grade 1, seems to be the E.Q. (r= .67). Since E.Q. does not 
give any indication of the actual level of achievement, but only 
of achievement compared with C.A., this is significant only as 
an index of educational brightness. The Pc.Av. in this instance 
does not seem to have any advantage (for predictive purposes ) 
over the I.Q. or M.A. scores obtained from the Kuhlmann- 
Anderson Intelligence Test. All three indices of intelligence 
(M.A., 1.Q., and Pc.Av.) are about equally effective for predic- 
tion. 

Coefficients of correlation between the K uhlmann-Anderson 
measures in Grade 1 and educational achievement in Grade 3 
range from .32 to .53, and between the same test (Kuhlmann- 
Anderson Intelligence Test, Grade 1) and educational achieve- 
ment in Grade 4 from .30 to .56. These low correlations indi- 
cate that long-range predictions of educational achievement 








Eo 


— 

















rr VS 8S PF UhULhLelOUe 








a 











KUHLMANN-ANDERSON TESTS 167 


based on only one group intelligence test in the first grade are 


highly questionable. 
REFERENCES 


1. Brown, M. E. “Measuring Mental Ability in the Intermediate 
Grades of the Elementary School.” School and Society, 
XXXV (1932), 323-324. 

2. Brown, A. W. and Lind, C. “School Achievement in Relation 
to Mental Age.” Journal of Educational Psychology, 
XXIT (1931), 561-576. 

3. Buzley, D. E. “A Study of Test Results at the Third and Fifth 
Grade Levels.” Psychological Clinic, XX (1931), 1-29. 

4. Cattell, P. “The Heinis Personal Constant as a Substitute for 
the 1.0.” Journal of Educational Psychology, XXIV 
(1933), 221-228. 

5. Durrell, D. D. “The Influence of Reading Ability on Intelli- - 
gence Measures.” Journal of Educational Psychology, 
XXIV (1933), 412-416. 

6.) Easley, H. “One of the Limits of Predicting Scholastic Success.” 
Journal of Experimental Education, 1 (1933), 272-276. 

7.. English, H. B. “The Predictive Value of Intelligence Tests.” 
School and Society, XXVI (1927), 783-799. 

8. Erffmeyer,C. A. “Intelligence Tests as an Aid in the Diagnosis 
of Academic Maladies.” School and Society, XX (1924), 
307-320. 

. Gates, A. I. “The Correlations of Achievement in School Sub- 
jects with Intelligence Tests.” Journal of Educational 
ee. XIII (1922), 277-285. 

ya Gates, A. I. “The Unreliability of M.A. and I.Q. Based on Group 
Tews of General Mental Ability.” Journal of Applied 
Psychology, VII (1923), 93-100. 

11. Guilford, J. P. Psychometric Methods. New York: McGraw- 
Hill, 1936. Pp. xi + 566. 

12. Hawthorne, J. W. “The Effect of Improvement in Reading 
Ability on Intelligence Test Scores.” Journal of Educa- 
tional Psychology, XXVI (1935), 41-51. 

13. Hilden, A. H. “A Comparative Study of the Intelligence Quo- 
tient and the Heinis Personal Quotient.” Journal of Ap- 
plied Psychology, XVII (1933), 355-375. 

14. Kelley, T. L., Ruch, G. M. and Terman, L. M. Guide for In- 
terpreting the New Stanford Achievement Test. Yonkers: 
World Book, 1929. Pp. 1-16. 


IS. Klein, A. “Intelligence Compared with Achievement.” High 


Points Bulletin, XII (1930), 3-5. 

16. Kuhlmann, F. “The Kuhlmann-Anderson Intelligence Tests 
Compared with Seven Others.” Journal of Applied Psy- 
chology, XII (1928), 545-594. 

17. Kuhlmann, F. and Anderson, R. Kuhlmann-Anderson Intelli- 
gence Instruction Manual, IV. Philadelphia: Educational 
Test Bureau, 1933. Pp. iv+125. 





168 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


18. Line, W. and Glen, J. S. “Some Relationships Between In- 
telligence and Achievement in the Public School.” Journal 
of Educational Research, XXVIII (1935), 582-599. 

19. McCall, Wm. Measurement. New York: Macmillan, 1939. 





Pp. xv + 535. 

20. Mitchell, A. C. “Prognostic Value of Intelligence Tests.” 
Journal of Educational Research, XXVIII (1935), 577-581. 

21. Riley, G. L. “A Comparison of the Personal Constant and 
Intelligence Quotient.” Psychological Clinic, XVIII 
(1930), 26-65. 

22. St. John, C. S. Educational Achievement in Relation to In- 
telligence. Cambridge, Massachusetts: Harvard Press, 
(1930). Pp. xiv +208. 


























MEASUREMENT ABSTRACTS* 


Arthur, Grace. “A Non-Verbal Test of Logical Thinking.” Journal of Consulting 

Psychology, VIII (1944), 33-34. 

The author has formulated a non-verbal test of logical thinking, similar in 
purpose to the Kohs Block Design Test, but employing designs to be reproduced 
with plain and stenciled cards in various colors. The problems presented by the 
designs were of three kinds, form, color, and sequence. Tentative norms established 
on the basis of results obtained from 500 subjects indicate an increase in average 
score from one age group to the next. Catherine Anne McNally. 





Bergmann, Gustav and Spence, Kenneth W. “The Logic of Psychophysical Mea- 
surement.” Psychological Review, LI (1944), 1-24. 

A methodological analysis of some of the problems of psychophysical measure- 
ment and of other aspects of measurement in psychology is presented from the 
standpoint of scientific empiricism. After a discussion of the methodological frame 
of reference, in which the necessity for operational definitions from a physicalistic 
basis is stressed, and a review of certain principles of physical measurement, an 
analysis of psychophysical measurement is given. It is concluded that not only 
should the use of various terms applicable to physical measurement be discouraged 
in psychophysical measurement, but that measurement in psychophysics should be 
set up as a “technique in its own right.” Lorraine Bouthilet. 





Brown, Fred. “An Experimental Study of the Validity and Reliability of the Brown 
— Inventory for Children.” Journal of Psychology, XVII (1944), 
The Brown Personality Inventory for Children was administered to 77 clinically 

diagnosed neurotic boys and 200 normal boys between the ages of 8 and 15 and in 

Grades 4-9, inclusive, in order to determine whether the inventory would differen- 

tiate reliably and consistently between maladjusted and normal children. Highly 

significant differences between the two groups were found in each of the five cate- 
gories of the instrument. These data were supplemented by two additional experi- 
ments with the personality inventory, which resulted in high test-retest correlations 
and consistently high reliabilities for the stability of individual items. Catherine 
Anne McNally. 





Brown, Fred. “Comparative Study of the Intelligence of Jewish and Scandinavian 
Kindergarten Children.” Journal of Genetic Psychology, LXIV (1944), 67-92. 
Three hundred and twenty-three (131 males, 192 females) second-generation 

Scandinavian and 324 (178 males, 146 females) second-generation Jewish kinder- 
garten children in the Minneapolis. Public Schools were tested on the 1916 Revision 
of the Stanford-Binet. Rigorous control of age, sex, and socio-economic status in 
selection of the data was exercised. Comparisons made between the performance 
of the two groups on general intelligence, basal age, and vocabulary show no signi- 
ficant differences. However, there appears to be a difference when the various sub- 
tests are considered. Variation among occupational levels is greater for each of the 
group studied than the variation within individual occupational levels, and the dif- 
ference that appears between the two groups decreases as one passes from lower to 
upper occupational levels. Miriam D. Rotman. 


* Edited by Forrest A. Kingsbury. 
169 








170 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Cattell, Raymond B. “An Objective Test of Character-Temperament: II.” Journal 

of Social Psychology, XIX (1944), 99-113. 

This study encompasses three distinct researches: the first dealing with a group 
of 60 school children, the second with 49 adult women, and the third with 40 
students. The objective of the study was to determine how far a personality trait 
recognizable in everyday life situations can be made to express itself also in a mini- 
ature laboratory situation, objectively scorable and requiring no great complexity 
of apparatus. The results indicate that the Character-Temperament Test so formu- 
lated has high consistency, is almost uncorrelated with intelligence, and shows no 
significant sex difference in mean performance. Catherine Anne McNally. 





Festinger, Leon. “A Statistical Test for Means of Samples from Skew Populations.” 

Psychometrika, VIII (1943), 205-210. 

This paper presents a test for determining significance of differences between 
means of samples which are drawn from positively skewed populations, more specif- 
ically, those having a Pearson Type III distribution function. The quantity 2npxs/xp 
(where p equals the mean squared divided by the variance and n is the number of 
cases in the sample), which distributes itself as Chi Square for 2p degrees of 
freedom, may be referred to the tables of Chi Square for testing hypotheses about 
the value of the true mean. For two independent samples, the larger mean divided 
by the smaller mean, which distributes itself as F for 2n:p: and 2m2p, degrees of 
freedom, may be referred to the F distribution tables for testing significance of dif- 
ference between means. The test assumes that the range of possible scores is from 
zero to infinity. When a lower theoretical score limit (c) exists which is not zero, 
the quantity (Mean-—c) should be used instead of the mean in all calculations. 
(Courtesy Psychometrika.) 





Findley, Warren G. (Chairman of the Committee on Psychological Tests) “Psy- 
chological Tests and Their Uses.” Review of Educational Research, XIV 
(1944), 1-27. 

This issue reviews the literature for the three years ending July, 1943. The 
following titles are included: 

. Findley, Warren G. “Brief Overview of the Period.” 

. Cornell, Ethel L. “Current Construction and Evaluation of Intelligence Tests. 

Freeman, Frank S. “Applications of Intelligence Tests.” 

Sells, Saul B. “Measurement and Prediction of Special Abilities.” 

. Traxler, Arthur E. “Current Construction and Evaluation of Personality and 

Character Tests.” 

Darley, John G. and Anderson, Gordon V. “Applications of Personality and 

Character Measurement.” 

Symonds, Percival M., Krugman, Morris, and Albert, Kathryn. “Projective 

Methods in the Study of Personality.” 

Findley, Warren G. “Measurement of Psychoeducational Growth.” Lorraine 

Bouthilet. 


2 ND when 





Fleming, Virginia. “A Study of the Subtests in the Revised Stanford-Binet L and 

M.” Journal of Genetic Psychology, LXIV (1944), 3-36. 

The author systematically investigates the problem of whether or not the sub- 
tests in the Revised Stanford-Binet tests were appropriately placed in reference to 
dificulty. She submitted 210 Form L and 118 Form M Stanford-Binets to three 
methods of analysis: (1) calculation of percentage of successes on each subtest, (2) 
calculation of the critical ratio for the difference in percentage of the same mental 
age group passing each subtest, (3) a refinement of the second method, where cases 
were selected only if they had taken every subtest within a year level. Using the 
first method she found that there are several levels where the subtests are of unequal 
difficulty. However, in comparing her results to those of Barber, who conducted a 
similar study, she found that they do not agree as to which are of unequal difficulty. 
The last two more refined methods of analysis showed that in general subtests within 
the same level were of equal difficulty. Miriam D. Rotman. 











- 











ral 


up 


ait 
li- 


u- 
no 


de 


ie 

















MEASUREMENT ABSTRACTS 171 


Greenwood, Edward D., Snider, Hervon L. and Senti, Milton M. “Correlation Be- 
tween the Wechsler Mental Ability Scale, Form B, and Kent Emergency Test 
(E-G-Y) Administered to Army Personnel.” American Journal of Orthopsy- 
chiatry, XIV (1944) 171-173. 

Two hundred maladjusted army men were given the Wechsler Mental Ability 
Scale, Form B, and the Kent Emergency Test, E-G-Y. As a result, a coefficient 
of correlation of .74+.02 was found between the Total Standard Score I.Q. of the 
Wechsler Mental Ability Scale and the Kent Emergency Test 1.Q. Allowing for the 
abnormal group of men to whom the tests were given, this correlation was con- 
sidered high. The authors concluded that the Kent Emergency Test was a suitable 
intelligence test in situations which do not permit more extensive testing. Catherine 
Anne McNally. 


Gulliksen, Harold. “A Course in the Theory of Mental Tests.” Psychometrika, 

VIII (1943), 223-245. 

An outline for a course in test theory is presented, together with a list of 
assignments, problems, and a bibliography. The course has been given in the 
Psychology Department of the University of Chicago. The material is presented 
in outline form at the present time because of the increased need for training in test 
theory due to the increase in the use of psychological tests for classification of mili- 
tary personnel, and because much of the material in such a course must be selected 
from a wide array of articles in the literature. This material is presented in order 
that an organized body of material for instructional purposes may be readily avail- 
able to those interested. (Courtesy Psychometrika.) 








Hunt, William A., Wittson, Cecil L. and Harris, Herbert I. “The Screen Test in 

Military Selection.” Psychological Review, LI (1944), 37-46. 

The authors compare the various paper-and-pencil psychological tests and the 
psychiatric interview as used in the pre-induction screening process. They find the 
psychiatric interview method more flexible, more inclusive and easier to administer 
from the mechanical viewpoint; psychological test procedure is more economical 
both in manpower and time and is better standardized and more objective. As yet 
there is no final check as to which is the preferable procedure, but it is their belief 
that a good psychiatrist is a better screening instrument than a good test and a good 
test is better than a poor psychiatrist. Miriam D. Rotman. 





Jurgensen, Clifford E. “A Nomograph for Rapid Determination of Medians.” 

Psychometrika, VIII (1943), 265-269. 

Directions are given for constructing a very simple nomograph for computing 
medians, which is entered with information from the cumulative frequency dis- 
tribution. It gives a linear interpolation within the class interval containing the 
median. (Courtesy Psychometrika.) 





Thornton, G. R. “The Significance of Rank Difference Coefficients of Correlation.” 

Psychometrika, VIII (1943), 211-222. 

The coefficients of rank difference correlation that are barely significant at six 
different levels of significance are given for N’s of 2 to 30. Most of the values were 
obtained by translation of Olds’ tables of probabilities for various values of 2d’. 
Comparison of these data with those obtained by four other methods indicates that 
one method yields values more appropriate than those obtained from Olds’ data for 
coefficients significant at the .01 level for N’s from 11 to 25. This method also 
provides a convenient means of obtaining approximate values of coefficients signifi- 
cant at the .01 level for N’s above 30. Need for caution in evaluating the signifi- 
cance of coeffiecients obtained from data involving tie rankings is indicated. The 
article concludes with recommendations as to choice of methods of determining the 
significance of rank difference coefficients. (Courtesy Psychometrika.) 





Tinker, Miles A. “Speed, Power, and Level in the Revised Minnesota Paper Form 
Board Test.” Journal of Genetic Psychology, LXIV (1944), 93-97. 
The Revised Minnesota Paper Form Board Test was administered to 103 uni- 
versity sophomores to analyze the relationship of the work limit method of measure- 











172 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


ment resulting in “speed” scores to the unlimited time method resulting in “level” 
scores and the relation of both of these to standard time method resulting in “power” 
scores. Speed and level scores were found to vary independently. A major propor- 
tion of the power score was accounted for by speed and level, with speed contribu- 
ting relatively more to the power score than level. This study indicates only a slight 
correlation between intelligence and the Revised Paper Form Board Test. Catherine 
Anne McNally. 


Wherry, Robert J. and Gaylord, Richard H. “The Concept of Test and Item Reli- 

ability in Relation to Factor Pattern.” Psychometrika, VIII (1943), 247-264. 

It is shown that approaches other than the internal consistency method of 
estimating test reliability are either less satisfactory or lead to the same general 
results. The commonly attendant assumption of a single factor throughout the test 
items is challenged, however. The consideration of a test made up of K sub-tests 
each composed of a different orthogonal factor disclosed that the assumption of a 
single factor produced an erroneous estimate of reliability with a ratio of (n—K)/ 
(n—1) to the correct estimate. Special difficulties arising from this error in applica- 
tion of current techniques to short tests or to test batteries are discussed. Applica- 
tion of this same multi-factor concept to item-analysis discloses similar difficulties 
in that field. The item-test coefficient approaches V1/K as an upper limit rather 
than 1.00 and approaches V1/n as a lower limit rather than .00. This latter 
finding accounts for an over-estimation error in the Kuder-Richardson formula (8). 
A new method of isolating sub-tests based upon the item-test coefficient is proposed 
and tentatively outlined. Either this new method or a complete factor analysis is 
regarded as the only proper approach to the problem of test reliability, and the 
item-sub-test coefficient is similarly recommended as the proper approach for item 
analysis. (Courtesy Psychometrika.) 











Lt. Comdr. C. L. Wittson, U.S.N.R., Lt. Comdr. W. A. Hunt, U.S.N.R., Lt. (JG) 
H. J. Older, U.S.N.R. “The Use of the Multiple Choice Group Rorschach 
Test in Military Screening.” Journal of Psychology, XVII (1944). 91-94. 
The Harrower-Erickson Multiple Choice Group Rorschach Test was tried out 

on a sample population, consisting of three groups, at the U. S. Naval Training 

Station, Newport, R. I. The test was unsuccessful in avoiding “false positives,” 

picking up 44 per cent of the group of 417 normal subjects. It was unsatisfactory 

in picking out true positives, culling only 59 per cent of a group of 235 men “dis- 

charged as unfit for Naval service for neuropsychiatric reasons.” Furthermore, of a 

group of 181 subjects previously “admitted to the observation ward for careful 

study but finally adjudged fit for service,” the test picked up 59 per cent as belonging 
to the abnormal group. The authors found the test “unsuitable in its present stage 
of development for military selection.” Ralph J. Slattery. 


























$$ 














NEWS NOTES* 


An institute on student personnel work was held on the Los Angeles Campus 
of the University of California during the week beginning July 24, in connection 
with the 1944 Summer Session. 

The institute was designed to help colleges and universities of the western 
states in the evaluation and development of student personnel services. It was 
planned in collaboration with Western Personnel Service, itself a cooperative associa- 
tion of western colleges and universities formed to work together on student per- 
sonnel problems. The Academic Council of Western Personnel Service, under the 
chairmanship of Dean Karl Onthank of the University of Oregon, assisted Wini- 
fred Hausam, Director, and Helen Fisk, Associate Director, in the preparation of 
the program. 

Leader of the institute was Dr. E. G. Williamson, Dean of Students, University 
of Minnesota; President of the American College Personnel Association; Chairman 
of the Student Personnel Committee of the American Council on Education. During 
the war, Dr. Williamson has been chairman of the Advisory Committee to the 
United States Armed Forces Institute; chairman of the Committee on Training of 
the Commission on Vocational Counseling of Veterans, War Manpower Commission; 
and consultant to the Adjutant General’s Department concerning counseling of 
soldiers as part of the demobilization program. 





Lt. Hugh M. Bell, A.G.D., is stationed at the Ninth Service Command Special 
Training Center, Camp McQuaide, California. The work of psychologists at the 
Special Training Center is described in an article by Lt. Bell and Lt. Altus in the 
Psychological Bulletin, March, 1944. 





Francis F. Bradshaw is Dean of the College for War Training at the University 
of North Carolina, Chapel Hill, North Carolina. 





Lucile B. Brown, formerly on the personnel staff at Northwestern University 
and later with Sears, Roebuck and Company, is overseas with the American Red 
Cross. For a time Mrs. Brown was in England and then was sent to North Africa. 
Her address is American Red Cross, A.P.O. 763, in care of Postmaster, New York 
City. 





R. K. Compton, formerly Dean of the Division of General Science and Chairman 
of the Department of Psychology at South Dakota State College, is now Personnel 
Director, Hastings Manufacturing Co., Hastings, Michigan. 





The new address of Lt. Wilbur S. Gregory, Guidance Consultant and Instructor 
in Psychology at the University of Nebraska, is Research Division, A.A.F. Instruc- 
tors School (Flexible Gunaery), Laredo Army Air Field, Laredo, Texas. 





Elias Lyman, Chairman of the Board of Personnel Administration at North- 


western University, resigned on May first and has returned to his old home at 


* News items concerning members of the American College Personnel Associa- 
tion should be sent to Grace E. Manson, Northwestern University, Evanston, 
Illinois. 


173 








174 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Lincoln, Vermont. F. George Seulberger, Professor of Cooperative Education and 
Chairman of the Department of Industrial Relations in the Northwestern Tech- 
nological Institute, has been appointed Dean of Students, replacing Mr. Lyman. 





Lt. James A. McClintock, U.S.N.R., is on leave as Director of Personnel and 
Professor of Psychology at Brothers College, Drew University, and is stationed at 
State College, Pennsylvania. 





Lt. (j.g.) Dewey B. Stuit, U.S.N.R., Associate Professor of Psychology at the 
University of Iowa (on leave), is Executive Officer, Naval Unit, Stevens Institute 
of Technology. 





Frances O. Triggs, previously on the staff of the Personnel Bureau, University 
of Illinois, is now with the Social Security Board, Washington, D. C. 





Ensign A. C. Van Dusen, U.S.N.R., formerly active in personnel work at the 
University of Florida, is now stationed at the Aerial Free Gunnery Central Standard- 
ization Committee, N.A.O.T.C., Naval Air Station, Jacksonville, Florida. 





Mrs. Ada S. Westover, who has been Assistant Dean of Women at the Univer- 
sity of Nebraska for the past ten years, has resigned to enter the field of medical 
social work in Cleveland, Ohio. Mrs. Westover had charge of the part-time em- 
ployment of women students in addition to her work as Counselor. 





Lt. C. Gilbert Wrenn, U.S.N.R., is acting as Secretary of the New Military 
Section of the American Association for Applied Psychology. Lt. T. Ernest New- 
land, U.S.N.R., also a member of the Section, attended the recent meeting held 
in Washington, D. C. 




















ANNOUNCEMENT 


Following an almost unanimous vote by the membership 
to the effect that the nominating ballot should be considered 
as the final voting ballot, the following American College Per- 
sonnel Association officers were designated for the year 
1944-45: 

President: E. G. Williamson,* Dean of Students, Univer- 

sity of Minnesota, Minneapolis, Minnesota 

Vice-Pres.: D. D. Feder., Bureau of Naval Personnel, 

Washington, D. C. 
Secretary: Thelma Mills,t Director of Student Affairs for 
Women, University of Missouri, Columbia, 
Missouri 
Treasurer: W. W. Blaesser,* Administrative Secretary of 
the Personnel Council, University of Wisconsin, 
Madison 6, Wisconsin 
Members-at-Large of the Executive Council: 
J. L. Bergstresser, Dean of Students, College of the City of 
New York. 
A. J. Brumbaugh, Dean of Students, University of Chicago, 
Chicago, Illinois. 

Helen G. Fisk, Associate Director, Western Personnel Serv- 

ice, Pasadena, California. 

Robert Hoppock, Professor of Education, New York Uni- 

versity, Washington Square, New York, New York. 

Esther Lloyd-Jones, Professor of Education, Teachers Col- 

lege, Columbia University, New York, New York. 





* Serving the second year of a two-year term. 
+t The secretary is elected for a two-year term (194446). 


175 











